Methodology

Overview

Open Editors Plus is a two-stage pipeline that (1) scrapes editorial board listings from publisher websites, then (2) enriches each record with institutional, demographic, and bibliometric metadata from open APIs.

Stage 1: Editorial board scraping

The scraper targets 48 academic publishers and extracts editor names, roles, affiliations, and ORCIDs from publicly available journal editorial board pages. Five complementary scraping strategies are employed depending on publisher infrastructure:

Static HTML

Requests + BeautifulSoup for publishers with server-rendered pages (majority of publishers).

Dynamic rendering

Playwright/Chromium for JavaScript-heavy pages that require browser execution.

REST APIs

Direct API access for publishers with structured endpoints (IEEE, ORCID v3).

Stealth browser (Crawl4AI)

Headless Chromium with anti-detection for Cloudflare-protected sites (Elsevier, AIP).

LLM fallback (Ollama)

Local Qwen 2.5 32B model for heterogeneous layouts where CSS/XPath parsing fails. Used as a last resort.

Data quality measures

  • Encoding repair: Three-layer mojibake correction (ftfy, algorithmic reversal, table-based replacement) for names in non-Latin scripts.
  • Affiliation cleaning: Universal cleaning pipeline strips roles, credentials, dates, and junk from all publisher affiliations.
  • Checkpoint resumption: Completed (publisher, journal) pairs are tracked, enabling interruption-safe runs.
  • Rate limiting: 1.5-4 second delays between requests to respect publisher infrastructure.

Stage 2: Enrichment pipeline

Each scraped record passes through a multi-stage enrichment pipeline using open data sources:

Stage Source Fields added
Name validation probablepeople (ML) Parsed name components
Institution canonicalization ROR API ror_id, ror_name, ror_country, org_type, lat/lon
Gender inference (country-aware) WGND 2.0 (Raffo & Lax-Martinez, WIPO) → gender-guesser fallback gender, gender_prob, gender_nobs, gender_source
Field classification OpenAlex API scientific_domain/field/subfield/topic
Bibliometrics OpenAlex API h_index, total_publications, total_citations, academic_age
Journal indexing PubMed, Scopus, WoS, DOAJ, COPE lists indexed_pubmed/scopus/wos/doaj/cope
Norwegian index NPI database npi_level, npi_discipline, npi_field
Journal metrics OpenAlex API journal_h_index, mean_citedness, oa_impact_quartile (OpenAlex citedness quartile per field — not JIF/CiteScore)
Funding sources OpenAlex API top_funder_1/2/3
Board-level diversity Computed board_size, board_pct_female, country_count, country_hhi

Role taxonomy

Editorial role labels on publisher pages are a mess: every publisher uses slightly different wording, capitalisation, and ordering (Editor-in-Chief, Chief Editor, EDITOR-IN-CHIEF, Associate Editors, Associate Editorial Board Members, Editorial Board Member(s), and so on). The scraper lowercases the raw string and runs it through a deterministic substring-based normaliser to produce role_std. The full mapping is exactly this, in priority order (first matching rule wins):

Pattern (case-insensitive substring) role_std Rows in release
editor-in-chief · editor in chief · chief editor editor_in_chief 17,210
associate associate_editor 180,781
section section_editor 42,940
review reviewing_editor 35,194
deputy deputy_editor 3,808
board editorial_board_member 470,515
guest guest_editor 535
editor · editora (generic fallback) editor 112,945
anything else other 55,810

Priority ordering matters. A raw role like "Associate Editor-in-Chief" matches both the editor-in-chief and the associate rules. The first rule in the table above wins, so that string is classified as editor_in_chief. A raw role like "Section Board Member" matches both section and board; section is higher in the order and wins, so it becomes section_editor (that's why the section_editor bucket has 42,940 rows despite the raw text almost always saying "Board Member").

The normaliser is deterministic and comes directly from normalize_role() in scripts/scrape_editorial_boards_2026.py (lines 559–579). If you re-run the scraper with a different rule set, the role_std column will change but role (the raw, unstandardised string) is always preserved so you can reclassify downstream.

The other bucket (~6% of rows) is dominated by Topical Advisory Panel Members, scientific/advisory committees, and language-specific variants that don't share any of the eight keywords above. None of the high-stakes quality journals use those labels, so their effect on cross-publisher comparisons is small; the bucket is mostly a catch-all for regional and predatory outlets.

Board diversity indices

Each per-entity page (country, field, publisher, institution) shows a Board diversity panel with up to three per-editor indicators: country diversity, institution-type diversity, and seniority (academic age). All three are computed against the entity's unique editors (composite identity key), not its editorial positions, so a prolific editor who sits on many boards at one publisher contributes their institution type and academic age exactly once.

Normalized Shannon index

Country diversity and institution-type diversity both use the normalized Shannon index, also known as Pielou's evenness (Pielou, 1966). For a categorical distribution with proportions p1, p2, …, pk across k observed categories:

  Shannon entropy:     H = −Σ  pi · ln(pi)            (nats)
  Pielou evenness:     J = H / ln(k)                    ∈ [0, 1]
        

H is the classical Shannon entropy from information theory (Shannon, 1948); dividing by ln(k) normalises it against the maximum possible entropy for k categories (a uniform distribution), giving a comparable number in [0, 1] regardless of how many categories the entity spans. An entity with J = 1 distributes its editors perfectly evenly across every country or institution type it observes; J = 0 means all editors fall in the same bucket. A publisher with J = 0.6 spreads editors across its observed categories with moderate unevenness — far from uniform, but not concentrated in a single group.

Country diversity — how it's calculated

The aggregator collects the ror_country of every unique editor at this entity into a Counter, computes H over that distribution, and divides by ln(k) where k is the number of distinct countries present. Editors whose institution is unresolved (empty ror_id) are excluded from both the numerator and the denominator.

Use cases. Compare publishers to see which has the broadest international footprint (MDPI and Frontiers score higher than Elsevier here for editor counts, reflecting their more geographically dispersed boards). Compare scientific fields to see which are more internationally concentrated. For country detail pages this panel is hidden because the country distribution is trivially a single-country point mass.

Limitations. Dataset-wide publisher bias (the 48 publishers this site covers are not uniformly distributed across countries to begin with) propagates into the country Shannon — a publisher with a US-dominated journal catalog will score low regardless of its editorial policies. Use it as a relative indicator, not an absolute one.

Institution-type diversity — how it's calculated

Uses the org_type field that ROR assigns to each institution. ROR's taxonomy has seven possible values: education (universities, colleges, and schools), healthcare (hospitals, clinics, and medical networks), government (public research agencies, ministries, and labs), facility (shared research infrastructure: telescopes, synchrotrons, HPC centres), nonprofit (scientific societies, NGOs, foundations), company (private-sector research, pharma, biotech), and archive (libraries, museums, long-term archival institutions).

The aggregator counts each unique editor's org_type exactly once per entity and then computes Pielou's J over the resulting Counter. A value of J = 0 means every editor at this entity is classified into the same ROR type — typically "education", which accounts for about 82% of editors in the whole dataset. A value of J = 1 means the entity's editors are perfectly spread across the types that appear at all.

Why this matters scientometrically. Editorial boards drawn exclusively from universities present a specific epistemic stance on what counts as research: they systematically under-represent clinical practitioners (healthcare), public-sector researchers (government labs), applied R&D workers (company), and community-facing disciplines (nonprofit). A medical journal whose board is 98% education has very different biases from one whose board is 60% education + 35% healthcare, even if both have the same gender share and the same mean h-index. The institution-type diversity score makes that pattern legible at a glance.

How to use it. Rank a set of journals or publishers by this column to spot editorial boards that reach outside academia. Conversely, a very low institution-type Shannon at a publisher whose journals claim to serve practitioners is a red flag. Use it alongside the country diversity score: some journals are internationally broad but institutionally monochrome (globally sourced academics) while others are the reverse.

Limitations. ROR org_type is itself a simplification. A university hospital that publishes clinical trials sits at education in ROR even though the relevant institutional identity for editorial work is healthcare. A research institute federated with a university (e.g. Max Planck, CNRS) may be tagged as facility or government depending on ROR's classification. Edge cases push this indicator by a few percentage points in either direction — interpret differences of 0.05 or less as noise.

Seniority (academic age)

Academic age = current year (2026) − year of each editor's first OpenAlex- indexed publication. The panel reports the median age, the interquartile range (IQR = 75th percentile − 25th percentile), and the mean across unique editors at the entity.

A low IQR with a low median signals a board dominated by one career stage — typically early-career editors clustered in a narrow age band. A high IQR means the board spans junior and senior editors together, which is what a healthy generational mix looks like. A high median with a large IQR indicates an entrenched senior board with some token early-career presence.

Limitations. OpenAlex indexing is biased toward STEM and toward English-language publishing. Authors who first published in a non-indexed venue appear younger than they are, and entire humanities subfields have lower academic-age figures across the board for reasons that have nothing to do with editorial policy.

References

  • Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27, 379–423. — Original definition of information entropy. PDF
  • Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. Journal of Theoretical Biology 13, 131–144. — Introduced the J = H / ln(k) normalisation used here. doi:10.1016/0022-5193(66)90013-0 · alternate sources: Google Scholar · Wikipedia summary.
  • Jost, L. (2006). Entropy and diversity. Oikos 113, 363–375. — Modern reference on why Shannon is the right functional form for diversity comparisons and how to interpret the normalised value. doi:10.1111/j.2006.0030-1299.14714.x
  • Wikipedia — Shannon diversity index (accessible introduction with worked examples).

Limitations and disclaimers

Gender inference and coverage bias

Gender is algorithmically inferred from the editor's first name combined with their country of affiliation using WGND 2.0 (the World Gender Name Dictionary by Raffo & Lax-Martinez, WIPO 2021). The country signal disambiguates names whose gender flips across cultures — e.g. "Andrea" is male in Italy but female in the United States. The legacy gender-guesser library is retained as a tertiary fallback for names absent from WGND. Gender is not self-reported gender identity. gender_prob carries the WGND weight in [0,1]; gender_nobs the WGND sample size; gender_source the provenance (wgnd_country / wgnd_global / gender_guesser / unknown).

Crucially, coverage is not uniform across countries. WGND 2.0 covers ~3.5M unique first names across 195 countries, materially improving coverage on Slavic, Arabic, South Asian, and African names compared to the older Latin-script-centric gender-guesser dictionary. The largest gain is on East Asian editorial-board members: classification of editors affiliated to Chinese institutions rose from 8.0% under the v2.6 gender-guesser pipeline to 44.6% under the v2.7 WGND pipeline, and South Korea from 11.6% to 40.2% (after applying the gender_prob ≥ 0.75 confidence floor described below). Transliterated CJK-script names from Taiwan (19.2%) and Hong Kong (59.4%) remain the principal residual gap. Every per-entity page on this site reports pct_gender_classified alongside the female share, and entities below 60% coverage display a prominent warning band: the reported female share is then a fraction of the classifiable minority, not of the whole board.

Concretely in the current release:

  • Italy, Germany, UK, US, France: 80–95% classified — the female share is trustworthy
  • Japan: around 80% classified — reasonably trustworthy
  • China: around 34% classified — two-thirds of Chinese editors' names are unclassified and the displayed female share only covers the 34%
  • South Korea, Taiwan, India: 16–41% classified — treat per-country female shares as indicative at best

When comparing countries, compare the pct_gender_classified values first; gaps in coverage often explain away apparent gender imbalances. The global pct_female headline (currently 33.0% of unique editors with a resolved gender; 82.0% of unique editors are gender-classified) is computed the same way and inherits the same bias — it is a lower bound on the share in well-covered regions and a less meaningful number in poorly-covered ones.

Confidence floor: gender_prob ≥ 0.75

v2.7 applies a confidence floor of gender_prob ≥ 0.75 at WGND lookup time. Matches with weights below the floor are demoted to gender = 'unknown', gender_source = 'unknown', while the raw gender_prob and gender_nobs stay populated for transparency. The threshold was set empirically from a 100-row manual validation sample drawn from v2.7 (50 records from gender_source = wgnd_country and 50 from wgnd_global): pre-threshold precision was 97/100 (97.0%), with 3 mis-classifications all on the wgnd_global layer at probabilities 0.50, 0.52 and 0.71. Applying the gender_prob ≥ 0.75 floor demotes those 3 wrong calls to gender = 'unknown', leaving the resolved post-threshold subset at 95/95 = 100.0% precision on this sample at a coverage cost of ~5 pp. Researchers who want the raw v2.7 inferences (no floor) can recover them by filtering the master parquet on the unmodified gender_prob column or by calling the lookup with wgnd.annotate(..., min_prob=0.0).

Indexing as quality proxy

Indexing columns (indexed_pubmed, indexed_scopus, etc.) reflect database membership status at scraping time. Absence from an index does not necessarily indicate low quality — new, regional, or specialized journals may not yet be indexed.

Affiliation currency

Affiliations were scraped from publisher websites and may not reflect editors' current institutional appointments.

Data sources

This project relies on several open data sources and APIs. We are grateful to the teams behind each of these resources:

Source Used for License
OpenAlex Bibliometrics (h-index, citations), journal classification, journal metrics, funding sources CC0
ROR (Research Organization Registry) Institution canonicalization, geolocation (country, city, coordinates) CC0
WGND 2.0 (Raffo & Lax-Martinez, WIPO) Country-aware gender inference from first names (~3.5M unique names, 195 countries) CC0
gender-guesser Tertiary gender-inference fallback for names absent from WGND GPL-3.0
Scopus Source List Journal indexing status (indexed_scopus) Publicly available list
Web of Science Master Journal List Journal indexing status (indexed_wos) Publicly available list
PubMed / NLM Catalog Journal indexing status (indexed_pubmed) Public domain
DOAJ (Directory of Open Access Journals) Open access status, journal indexing CC-BY-SA
COPE (Committee on Publication Ethics) Publisher ethics membership Public list
Norwegian Register for Scientific Journals (NPI/NSD) Norwegian Publishing Indicator level and classification Open data
ORCID Editor unique identifiers CC0 (public data)
probablepeople Name parsing and validation MIT

Acknowledgements

Open Editors Plus builds upon and is inspired by the original Open Editors project by Nishikawa-Pacher, Heck, and Schoch. We are deeply grateful to the teams behind OpenAlex, ROR, DOAJ, ORCID, and all the other open data initiatives that make this work possible. Open science infrastructure is a public good, and this project exists because of their commitment to open data.

Ethics statement

This dataset contains only publicly available information from journal editorial board pages. No private or restricted data sources were used. The scraping respects publisher rate limits and robots.txt directives. Race/ethnicity inference data was computed for internal analysis only and is not included in the public release. Gender inference is provided with full methodology transparency and confidence scores to enable responsible use.