Codebook
Column definitions for the public release dataset. 71 columns across 13 groups.
Provenance
Every column is tagged with one of four provenance flags so you can tell at a glance how the value was produced. Hover the badge on any row for the full derivation method.
scraped · 8 Direct from publisher HTML
external · 38 From authoritative API (ROR, OpenAlex, DOAJ, COPE, NPI)
inferred · 7 Algorithmic / ML inference — may carry bias
computed · 18 Derived arithmetically from other columns
Inferred columns (especially gender) carry systematic bias against non-Latin-script names — see the methodology page for coverage details.
Core (11 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| publisher | Publisher | string | scraped | Publisher name (e.g., Elsevier, Springer Nature, Wiley) — known from the scraper configuration for each source. |
| journal | Journal | string | scraped | Journal title as listed on the publisher website. |
| editor | Name | string | scraped | Editor full name as it appeared on the editorial board page, after mojibake repair and light cleaning. |
| first_name | First name | string | computed | Given name extracted from `editor` via the probablepeople ML name parser. Used as the lookup key for WGND 2.0 gender inference; preserved in the public release so users can re-use the parsed value without rerunning probablepeople. |
| last_name | Last name | string | computed | Surname extracted from `editor` via the probablepeople ML name parser. Empty when probablepeople could not parse the string; the original `editor` field is always preserved. |
| role | Role (raw) | string | scraped | Role label as listed on the publisher page, unmodified. |
| role_std | Role (standardized) | string | computed | Normalised to one of editor_in_chief, associate_editor, section_editor, reviewing_editor, editorial_board_member, deputy_editor, guest_editor, or other. Mapping is deterministic from the raw role string; 'other' absorbs ~6% of positions that don't match the canonical set. |
| affiliation | Affiliation | string | scraped | Institutional affiliation as listed, after universal cleaning (strips roles, credentials, dates, junk). Not canonicalised — see ror_name for the authoritative form. |
| orcid | ORCID | string | scraped | 16-digit ORCID iD. Preferred from the scrape when available; otherwise backfilled from OpenAlex or the ORCID API (see orcid_source). |
| source_url | Source URL | string | scraped | Publisher URL where this record was scraped. |
| scraped_at | Scraped at | datetime | scraped | ISO timestamp of when this record was scraped. |
Gender (5 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| gender | Gender | string | inferred | Country-aware inference from the editor's first name and ror_country using WGND 2.0 (Raffo & Lax-Martinez, WIPO 2021; Harvard Dataverse DOI 10.7910/DVN/MSEGSJ — ~3.5M names across 195 countries). The same name can resolve to different genders in different countries (e.g. 'Andrea' is male in Italy and female in the US). gender-guesser is used as a tertiary fallback for names absent from WGND. Values: male, female, andy (androgynous), unknown. Self-reported gender is NOT available in the dataset. |
| gender_raw | Gender (raw) | string | inferred | Backwards-compatibility column kept since v1; mirrors `gender` (WGND has no `mostly_male`/`mostly_female` granularity). |
| gender_prob | Gender weight | float | inferred | WGND weight in [0, 1] — probability that an individual with this first name in this country has the inferred gender. Replaces the legacy 3-bucket {0.0, 0.75, 1.0} confidence; downstream filters should use thresholds like `>= 0.95`. For gender-guesser fallback rows, 1.0 = certain, 0.75 = mostly_*, 0.0 = unknown. |
| gender_nobs | WGND sample size | int | inferred | Number of observed individuals in the WGND cell (name × country) underlying the inference. Useful for filtering low-confidence cells in analysis. 0 for gender-guesser-fallback rows. |
| gender_source | Gender source | string | inferred | Provenance of the inference: `wgnd_country` (matched on first_name + ror_country, strongest signal), `wgnd_global` (matched on first_name only, used when country was missing or the country-specific cell was empty), `gender_guesser` (tertiary fallback, country-blind), or `unknown`. |
Institution (ROR) (8 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| ror_id | ROR ID | string | external | Research Organization Registry identifier resolved from the raw affiliation string via the ROR /organizations?affiliation= fuzzy-match endpoint. Subject to score+token-overlap guards (see CHANGELOG v2.0.0) to reject low-confidence and blacklisted matches. |
| ror_name | Institution | string | external | Canonical institution name from the ROR v2 record (ror_display name). |
| ror_country | Country | string | external | Country name from ROR/GeoNames for the primary location of the institution. |
| ror_city | City | string | external | City from ROR/GeoNames for the primary location. |
| ror_state | State/Province | string | external | State or province from ROR/GeoNames when available. |
| org_type | Org type | string | external | Organization type from ROR: education, healthcare, government, facility, nonprofit, company, archive, or other. Taken from the first entry in the ROR record's types array. |
| latitude | Latitude | float | external | Geographic latitude of the institution from ROR/GeoNames. |
| longitude | Longitude | float | external | Geographic longitude of the institution from ROR/GeoNames. |
Classification (OpenAlex) (4 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| scientific_domain | Domain | string | external | Broadest classification level from OpenAlex Topics (e.g. Life Sciences, Physical Sciences). |
| scientific_field | Field | string | external | Mid-level OpenAlex field (e.g. Medicine, Engineering, Psychology). Assigned via the editor's own most-published topic, NOT via the journal's scope — so interdisciplinary editors can show a field that differs from the journal's nominal area. |
| scientific_subfield | Subfield | string | external | Narrow OpenAlex subfield classification for the editor's most-published topic. |
| scientific_topic | Topic | string | external | Most granular OpenAlex topic classification. |
Journal identifiers (2 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| openalex_source_id | OpenAlex source ID | string | external | OpenAlex Source identifier for the journal, resolved from the journal title and/or ISSN. |
| issn_l | ISSN-L | string | external | Linking ISSN (the ISSN-L groups print and electronic variants into one identifier) from OpenAlex. |
Journal metrics (7 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| oa_2yr_mean_citedness | Mean citedness (2yr) | float | external | OpenAlex 2-year mean citedness: average citations received by articles published in the last 2 years. Open-data analogue of the Clarivate Journal Impact Factor but computed from OpenAlex citation graph. |
| oa_journal_h_index | Journal h-index | int | external | Journal-level h-index from OpenAlex. |
| oa_journal_works_count | Journal works count | int | external | Total number of works published in the journal (OpenAlex count). |
| oa_journal_cited_by_count | Journal citations | int | external | Total citations received by the journal (OpenAlex count). |
| is_in_doaj | In DOAJ | bool | external | True if the journal is listed in the Directory of Open Access Journals (DOAJ). |
| is_oa | Open access | bool | external | True if OpenAlex classifies the journal as open access. |
| oa_impact_quartile | OpenAlex citedness quartile (per field) | string | computed | Q1–Q4 computed locally by this project, per OpenAlex scientific field, from oa_2yr_mean_citedness. Q1 = top 25% most cited within the same field. Accounts for different citation norms across disciplines. NOT the Clarivate JIF or Scopus CiteScore quartile — see CHANGELOG v2.0.0 for the rationale. |
Editor bibliometrics (5 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| h_index | h-index | int | external | Author h-index from OpenAlex, looked up by (name, ror_id) pair. |
| total_publications | Publications | int | external | Total number of works by this author in OpenAlex. |
| total_citations | Citations | int | external | Total citations received by this author in OpenAlex. |
| academic_age | Academic age | int | computed | Years since this author's first OpenAlex-indexed publication (current year minus earliest publication year). |
| orcid_source | ORCID source | string | computed | How the ORCID was obtained: scraped (from the publisher page), openalex (backfilled from the OpenAlex author record), or orcid_api (queried directly by name+affiliation). |
Indexing (7 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| indexed_pubmed | PubMed | bool | external | True if the journal is indexed in PubMed/MEDLINE (matched on ISSN against the NLM catalog). |
| indexed_scopus | Scopus | bool | external | True if the journal is indexed in Scopus (matched on ISSN against the Scopus source list). |
| indexed_wos | Web of Science | bool | external | True if the journal is indexed in Web of Science (matched on ISSN against the WoS master journal list). |
| indexed_doaj | DOAJ | bool | external | True if the journal is listed in the Directory of Open Access Journals (DOAJ). |
| indexed_cope | COPE | bool | external | True if the publisher is a member of COPE (Committee on Publication Ethics). Publisher-level flag applied to all of that publisher's journals. |
| indexed_npi | NPI | bool | external | True if the journal appears in the Norwegian Publishing Indicator register (at either level 1 or level 2). See the separate 'Norwegian Publishing Indicator' group below for the level and discipline fields. |
| indexing_count | Index count | int | computed | Sum of the indexed_* flags (PubMed, Scopus, WoS, DOAJ, COPE, NPI). Range 0–6. A rough journal-quality proxy independent of citation metrics. Used as the indexing weight in the experimental 'weighted power' score on the Network page. |
Norwegian Publishing Indicator (3 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| npi_level | NPI level | string | external | Norwegian Publishing Indicator level (1 or 2). Level 2 = top 20% of journals in the NPI register. Scope limited to Nordic-relevant disciplines. |
| npi_discipline | NPI discipline | string | external | Broad discipline in the NPI register. |
| npi_field | NPI field | string | external | Specific field in the NPI register. |
Funding (6 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| top_funder_1 | Top funder 1 | string | external | Most common funding organization for articles in this journal (from the OpenAlex Works funder metadata). |
| top_funder_1_count | Funder 1 count | int | external | Number of funded articles from the top funder. |
| top_funder_2 | Top funder 2 | string | external | Second most common funder. |
| top_funder_2_count | Funder 2 count | int | external | Count of funded articles. |
| top_funder_3 | Top funder 3 | string | external | Third most common funder. |
| top_funder_3_count | Funder 3 count | int | external | Count of funded articles. |
Board diversity (6 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| board_size | Board size | int | computed | Total number of editors on this journal's board (number of distinct editor rows sharing the same journal). |
| board_pct_female | % female | float | computed | Percentage of female editors on this board. Computed against the RESOLVED denominator (male + female), not total editors, so the value is not artificially depressed for boards where many editors have unknown inferred gender. |
| board_country_count | Countries on board | int | computed | Number of distinct ror_country values on the board. |
| board_country_hhi | Country HHI | float | computed | Herfindahl–Hirschman Index of country concentration. Sum of squared country shares on the board. 0 = maximally diverse across many countries; 1 = all editors from a single country. |
| board_institution_count | Institutions on board | int | computed | Number of distinct ror_id values on the board. |
| board_mean_h_index | Mean board h-index | float | computed | Arithmetic mean of h_index across board members with a resolved OpenAlex profile. |
Multi-board (3 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| boards_count | Boards served | int | computed | Number of distinct editorial boards this editor serves on in the dataset. |
| publishers_count | Publishers served | int | computed | Number of distinct publishers this editor serves across. |
| is_multi_board | Multi-board | bool | computed | True if boards_count >= 2. |
Metadata (4 columns)
| Column | Display name | Type | Source | Description |
|---|---|---|---|---|
| name_script | Script | string | inferred | Detected writing script of the editor's name (Latin, CJK, Cyrillic, Arabic, etc.). Used to document coverage gaps in downstream inference steps. |
| name_script_region | Script region | string | inferred | Geographic region associated with the name script. Heuristic, not authoritative. |
| data_version | Version | string | computed | Dataset version identifier (matches CHANGELOG.md). |
| enriched_at | Enriched at | datetime | computed | ISO timestamp of when enrichment completed for this row. |