Codebook

Column definitions for the public release dataset. 71 columns across 13 groups.

Provenance

Every column is tagged with one of four provenance flags so you can tell at a glance how the value was produced. Hover the badge on any row for the full derivation method.

scraped · 8 Direct from publisher HTML
external · 38 From authoritative API (ROR, OpenAlex, DOAJ, COPE, NPI)
inferred · 7 Algorithmic / ML inference — may carry bias
computed · 18 Derived arithmetically from other columns

Inferred columns (especially gender) carry systematic bias against non-Latin-script names — see the methodology page for coverage details.

Core (11 columns)

Column Display name Type Source Description
publisher Publisher string scraped Publisher name (e.g., Elsevier, Springer Nature, Wiley) — known from the scraper configuration for each source.
journal Journal string scraped Journal title as listed on the publisher website.
editor Name string scraped Editor full name as it appeared on the editorial board page, after mojibake repair and light cleaning.
first_name First name string computed Given name extracted from `editor` via the probablepeople ML name parser. Used as the lookup key for WGND 2.0 gender inference; preserved in the public release so users can re-use the parsed value without rerunning probablepeople.
last_name Last name string computed Surname extracted from `editor` via the probablepeople ML name parser. Empty when probablepeople could not parse the string; the original `editor` field is always preserved.
role Role (raw) string scraped Role label as listed on the publisher page, unmodified.
role_std Role (standardized) string computed Normalised to one of editor_in_chief, associate_editor, section_editor, reviewing_editor, editorial_board_member, deputy_editor, guest_editor, or other. Mapping is deterministic from the raw role string; 'other' absorbs ~6% of positions that don't match the canonical set.
affiliation Affiliation string scraped Institutional affiliation as listed, after universal cleaning (strips roles, credentials, dates, junk). Not canonicalised — see ror_name for the authoritative form.
orcid ORCID string scraped 16-digit ORCID iD. Preferred from the scrape when available; otherwise backfilled from OpenAlex or the ORCID API (see orcid_source).
source_url Source URL string scraped Publisher URL where this record was scraped.
scraped_at Scraped at datetime scraped ISO timestamp of when this record was scraped.

Gender (5 columns)

Column Display name Type Source Description
gender Gender string inferred Country-aware inference from the editor's first name and ror_country using WGND 2.0 (Raffo & Lax-Martinez, WIPO 2021; Harvard Dataverse DOI 10.7910/DVN/MSEGSJ — ~3.5M names across 195 countries). The same name can resolve to different genders in different countries (e.g. 'Andrea' is male in Italy and female in the US). gender-guesser is used as a tertiary fallback for names absent from WGND. Values: male, female, andy (androgynous), unknown. Self-reported gender is NOT available in the dataset.
gender_raw Gender (raw) string inferred Backwards-compatibility column kept since v1; mirrors `gender` (WGND has no `mostly_male`/`mostly_female` granularity).
gender_prob Gender weight float inferred WGND weight in [0, 1] — probability that an individual with this first name in this country has the inferred gender. Replaces the legacy 3-bucket {0.0, 0.75, 1.0} confidence; downstream filters should use thresholds like `>= 0.95`. For gender-guesser fallback rows, 1.0 = certain, 0.75 = mostly_*, 0.0 = unknown.
gender_nobs WGND sample size int inferred Number of observed individuals in the WGND cell (name × country) underlying the inference. Useful for filtering low-confidence cells in analysis. 0 for gender-guesser-fallback rows.
gender_source Gender source string inferred Provenance of the inference: `wgnd_country` (matched on first_name + ror_country, strongest signal), `wgnd_global` (matched on first_name only, used when country was missing or the country-specific cell was empty), `gender_guesser` (tertiary fallback, country-blind), or `unknown`.

Institution (ROR) (8 columns)

Column Display name Type Source Description
ror_id ROR ID string external Research Organization Registry identifier resolved from the raw affiliation string via the ROR /organizations?affiliation= fuzzy-match endpoint. Subject to score+token-overlap guards (see CHANGELOG v2.0.0) to reject low-confidence and blacklisted matches.
ror_name Institution string external Canonical institution name from the ROR v2 record (ror_display name).
ror_country Country string external Country name from ROR/GeoNames for the primary location of the institution.
ror_city City string external City from ROR/GeoNames for the primary location.
ror_state State/Province string external State or province from ROR/GeoNames when available.
org_type Org type string external Organization type from ROR: education, healthcare, government, facility, nonprofit, company, archive, or other. Taken from the first entry in the ROR record's types array.
latitude Latitude float external Geographic latitude of the institution from ROR/GeoNames.
longitude Longitude float external Geographic longitude of the institution from ROR/GeoNames.

Classification (OpenAlex) (4 columns)

Column Display name Type Source Description
scientific_domain Domain string external Broadest classification level from OpenAlex Topics (e.g. Life Sciences, Physical Sciences).
scientific_field Field string external Mid-level OpenAlex field (e.g. Medicine, Engineering, Psychology). Assigned via the editor's own most-published topic, NOT via the journal's scope — so interdisciplinary editors can show a field that differs from the journal's nominal area.
scientific_subfield Subfield string external Narrow OpenAlex subfield classification for the editor's most-published topic.
scientific_topic Topic string external Most granular OpenAlex topic classification.

Journal identifiers (2 columns)

Column Display name Type Source Description
openalex_source_id OpenAlex source ID string external OpenAlex Source identifier for the journal, resolved from the journal title and/or ISSN.
issn_l ISSN-L string external Linking ISSN (the ISSN-L groups print and electronic variants into one identifier) from OpenAlex.

Journal metrics (7 columns)

Column Display name Type Source Description
oa_2yr_mean_citedness Mean citedness (2yr) float external OpenAlex 2-year mean citedness: average citations received by articles published in the last 2 years. Open-data analogue of the Clarivate Journal Impact Factor but computed from OpenAlex citation graph.
oa_journal_h_index Journal h-index int external Journal-level h-index from OpenAlex.
oa_journal_works_count Journal works count int external Total number of works published in the journal (OpenAlex count).
oa_journal_cited_by_count Journal citations int external Total citations received by the journal (OpenAlex count).
is_in_doaj In DOAJ bool external True if the journal is listed in the Directory of Open Access Journals (DOAJ).
is_oa Open access bool external True if OpenAlex classifies the journal as open access.
oa_impact_quartile OpenAlex citedness quartile (per field) string computed Q1–Q4 computed locally by this project, per OpenAlex scientific field, from oa_2yr_mean_citedness. Q1 = top 25% most cited within the same field. Accounts for different citation norms across disciplines. NOT the Clarivate JIF or Scopus CiteScore quartile — see CHANGELOG v2.0.0 for the rationale.

Editor bibliometrics (5 columns)

Column Display name Type Source Description
h_index h-index int external Author h-index from OpenAlex, looked up by (name, ror_id) pair.
total_publications Publications int external Total number of works by this author in OpenAlex.
total_citations Citations int external Total citations received by this author in OpenAlex.
academic_age Academic age int computed Years since this author's first OpenAlex-indexed publication (current year minus earliest publication year).
orcid_source ORCID source string computed How the ORCID was obtained: scraped (from the publisher page), openalex (backfilled from the OpenAlex author record), or orcid_api (queried directly by name+affiliation).

Indexing (7 columns)

Column Display name Type Source Description
indexed_pubmed PubMed bool external True if the journal is indexed in PubMed/MEDLINE (matched on ISSN against the NLM catalog).
indexed_scopus Scopus bool external True if the journal is indexed in Scopus (matched on ISSN against the Scopus source list).
indexed_wos Web of Science bool external True if the journal is indexed in Web of Science (matched on ISSN against the WoS master journal list).
indexed_doaj DOAJ bool external True if the journal is listed in the Directory of Open Access Journals (DOAJ).
indexed_cope COPE bool external True if the publisher is a member of COPE (Committee on Publication Ethics). Publisher-level flag applied to all of that publisher's journals.
indexed_npi NPI bool external True if the journal appears in the Norwegian Publishing Indicator register (at either level 1 or level 2). See the separate 'Norwegian Publishing Indicator' group below for the level and discipline fields.
indexing_count Index count int computed Sum of the indexed_* flags (PubMed, Scopus, WoS, DOAJ, COPE, NPI). Range 0–6. A rough journal-quality proxy independent of citation metrics. Used as the indexing weight in the experimental 'weighted power' score on the Network page.

Norwegian Publishing Indicator (3 columns)

Column Display name Type Source Description
npi_level NPI level string external Norwegian Publishing Indicator level (1 or 2). Level 2 = top 20% of journals in the NPI register. Scope limited to Nordic-relevant disciplines.
npi_discipline NPI discipline string external Broad discipline in the NPI register.
npi_field NPI field string external Specific field in the NPI register.

Funding (6 columns)

Column Display name Type Source Description
top_funder_1 Top funder 1 string external Most common funding organization for articles in this journal (from the OpenAlex Works funder metadata).
top_funder_1_count Funder 1 count int external Number of funded articles from the top funder.
top_funder_2 Top funder 2 string external Second most common funder.
top_funder_2_count Funder 2 count int external Count of funded articles.
top_funder_3 Top funder 3 string external Third most common funder.
top_funder_3_count Funder 3 count int external Count of funded articles.

Board diversity (6 columns)

Column Display name Type Source Description
board_size Board size int computed Total number of editors on this journal's board (number of distinct editor rows sharing the same journal).
board_pct_female % female float computed Percentage of female editors on this board. Computed against the RESOLVED denominator (male + female), not total editors, so the value is not artificially depressed for boards where many editors have unknown inferred gender.
board_country_count Countries on board int computed Number of distinct ror_country values on the board.
board_country_hhi Country HHI float computed Herfindahl–Hirschman Index of country concentration. Sum of squared country shares on the board. 0 = maximally diverse across many countries; 1 = all editors from a single country.
board_institution_count Institutions on board int computed Number of distinct ror_id values on the board.
board_mean_h_index Mean board h-index float computed Arithmetic mean of h_index across board members with a resolved OpenAlex profile.

Multi-board (3 columns)

Column Display name Type Source Description
boards_count Boards served int computed Number of distinct editorial boards this editor serves on in the dataset.
publishers_count Publishers served int computed Number of distinct publishers this editor serves across.
is_multi_board Multi-board bool computed True if boards_count >= 2.

Metadata (4 columns)

Column Display name Type Source Description
name_script Script string inferred Detected writing script of the editor's name (Latin, CJK, Cyrillic, Arabic, etc.). Used to document coverage gaps in downstream inference steps.
name_script_region Script region string inferred Geographic region associated with the name script. Heuristic, not authoritative.
data_version Version string computed Dataset version identifier (matches CHANGELOG.md).
enriched_at Enriched at datetime computed ISO timestamp of when enrichment completed for this row.