Methodology
Overview
Open Editors Plus is a two-stage pipeline that (1) scrapes editorial board listings from publisher websites, then (2) enriches each record with institutional, demographic, and bibliometric metadata from open APIs.
Stage 1: Editorial board scraping
The scraper targets 48 academic publishers and extracts editor names, roles, affiliations, and ORCIDs from publicly available journal editorial board pages. Five complementary scraping strategies are employed depending on publisher infrastructure:
Static HTML
Requests + BeautifulSoup for publishers with server-rendered pages (majority of publishers).
Dynamic rendering
Playwright/Chromium for JavaScript-heavy pages that require browser execution.
REST APIs
Direct API access for publishers with structured endpoints (IEEE, ORCID v3).
Stealth browser (Crawl4AI)
Headless Chromium with anti-detection for Cloudflare-protected sites (Elsevier, AIP).
LLM fallback (Ollama)
Local Qwen 2.5 32B model for heterogeneous layouts where CSS/XPath parsing fails. Used as a last resort.
Data quality measures
- Encoding repair: Three-layer mojibake correction (ftfy, algorithmic reversal, table-based replacement) for names in non-Latin scripts.
- Affiliation cleaning: Universal cleaning pipeline strips roles, credentials, dates, and junk from all publisher affiliations.
- Checkpoint resumption: Completed (publisher, journal) pairs are tracked, enabling interruption-safe runs.
- Rate limiting: 1.5-4 second delays between requests to respect publisher infrastructure.
Stage 2: Enrichment pipeline
Each scraped record passes through a multi-stage enrichment pipeline using open data sources:
| Stage | Source | Fields added |
|---|---|---|
| Name validation | probablepeople (ML) | Parsed name components |
| Institution canonicalization | ROR API | ror_id, ror_name, ror_country, org_type, lat/lon |
| Gender inference (country-aware) | WGND 2.0 (Raffo & Lax-Martinez, WIPO) → gender-guesser fallback | gender, gender_prob, gender_nobs, gender_source |
| Field classification | OpenAlex API | scientific_domain/field/subfield/topic |
| Bibliometrics | OpenAlex API | h_index, total_publications, total_citations, academic_age |
| Journal indexing | PubMed, Scopus, WoS, DOAJ, COPE lists | indexed_pubmed/scopus/wos/doaj/cope |
| Norwegian index | NPI database | npi_level, npi_discipline, npi_field |
| Journal metrics | OpenAlex API | journal_h_index, mean_citedness, oa_impact_quartile (OpenAlex citedness quartile per field — not JIF/CiteScore) |
| Funding sources | OpenAlex API | top_funder_1/2/3 |
| Board-level diversity | Computed | board_size, board_pct_female, country_count, country_hhi |
Role taxonomy
Editorial role labels on publisher pages are a mess: every publisher uses slightly
different wording, capitalisation, and ordering (Editor-in-Chief,
Chief Editor, EDITOR-IN-CHIEF, Associate Editors,
Associate Editorial Board Members, Editorial Board Member(s), and
so on). The scraper lowercases the raw string and runs it through a deterministic
substring-based normaliser to produce role_std.
The full mapping is exactly this, in priority order (first matching rule wins):
| Pattern (case-insensitive substring) | role_std | Rows in release |
|---|---|---|
editor-in-chief · editor in chief · chief editor | editor_in_chief | 17,210 |
associate | associate_editor | 180,781 |
section | section_editor | 42,940 |
review | reviewing_editor | 35,194 |
deputy | deputy_editor | 3,808 |
board | editorial_board_member | 470,515 |
guest | guest_editor | 535 |
editor · editora (generic fallback) | editor | 112,945 |
| anything else | other | 55,810 |
Priority ordering matters. A raw role like "Associate
Editor-in-Chief" matches both the editor-in-chief
and the associate rules. The first rule in the table
above wins, so that string is classified as
editor_in_chief. A raw role like
"Section Board Member" matches both section
and board; section is
higher in the order and wins, so it becomes
section_editor (that's why the section_editor bucket
has 42,940 rows despite the raw text almost always saying "Board Member").
The normaliser is deterministic and comes directly from
normalize_role() in
scripts/scrape_editorial_boards_2026.py (lines
559–579). If you re-run the scraper with a different rule set, the role_std column
will change but role (the raw, unstandardised
string) is always preserved so you can reclassify downstream.
The other bucket (~6% of rows) is dominated by Topical
Advisory Panel Members, scientific/advisory committees, and language-specific
variants that don't share any of the eight keywords above. None of the high-stakes
quality journals use those labels, so their effect on cross-publisher comparisons
is small; the bucket is mostly a catch-all for regional and predatory outlets.
Board diversity indices
Each per-entity page (country, field, publisher, institution) shows a Board diversity panel with up to three per-editor indicators: country diversity, institution-type diversity, and seniority (academic age). All three are computed against the entity's unique editors (composite identity key), not its editorial positions, so a prolific editor who sits on many boards at one publisher contributes their institution type and academic age exactly once.
Normalized Shannon index
Country diversity and institution-type diversity both use the normalized Shannon index, also known as Pielou's evenness (Pielou, 1966). For a categorical distribution with proportions p1, p2, …, pk across k observed categories:
Shannon entropy: H = −Σ pi · ln(pi) (nats)
Pielou evenness: J = H / ln(k) ∈ [0, 1]
H is the classical Shannon entropy from information theory (Shannon, 1948); dividing by ln(k) normalises it against the maximum possible entropy for k categories (a uniform distribution), giving a comparable number in [0, 1] regardless of how many categories the entity spans. An entity with J = 1 distributes its editors perfectly evenly across every country or institution type it observes; J = 0 means all editors fall in the same bucket. A publisher with J = 0.6 spreads editors across its observed categories with moderate unevenness — far from uniform, but not concentrated in a single group.
Country diversity — how it's calculated
The aggregator collects the ror_country of every
unique editor at this entity into a Counter,
computes H over that distribution, and divides by ln(k)
where k is the number of distinct countries present. Editors whose
institution is unresolved (empty ror_id) are
excluded from both the numerator and the denominator.
Use cases. Compare publishers to see which has the broadest international footprint (MDPI and Frontiers score higher than Elsevier here for editor counts, reflecting their more geographically dispersed boards). Compare scientific fields to see which are more internationally concentrated. For country detail pages this panel is hidden because the country distribution is trivially a single-country point mass.
Limitations. Dataset-wide publisher bias (the 48 publishers this site covers are not uniformly distributed across countries to begin with) propagates into the country Shannon — a publisher with a US-dominated journal catalog will score low regardless of its editorial policies. Use it as a relative indicator, not an absolute one.
Institution-type diversity — how it's calculated
Uses the org_type field that ROR assigns to
each institution. ROR's taxonomy has seven possible values:
education (universities, colleges, and schools),
healthcare (hospitals, clinics, and medical networks),
government (public research agencies, ministries, and labs),
facility (shared research infrastructure: telescopes, synchrotrons,
HPC centres), nonprofit (scientific societies, NGOs, foundations),
company (private-sector research, pharma, biotech), and
archive (libraries, museums, long-term archival institutions).
The aggregator counts each unique editor's org_type exactly once per entity and then computes Pielou's J over the resulting Counter. A value of J = 0 means every editor at this entity is classified into the same ROR type — typically "education", which accounts for about 82% of editors in the whole dataset. A value of J = 1 means the entity's editors are perfectly spread across the types that appear at all.
Why this matters scientometrically. Editorial boards drawn exclusively from universities present a specific epistemic stance on what counts as research: they systematically under-represent clinical practitioners (healthcare), public-sector researchers (government labs), applied R&D workers (company), and community-facing disciplines (nonprofit). A medical journal whose board is 98% education has very different biases from one whose board is 60% education + 35% healthcare, even if both have the same gender share and the same mean h-index. The institution-type diversity score makes that pattern legible at a glance.
How to use it. Rank a set of journals or publishers by this column to spot editorial boards that reach outside academia. Conversely, a very low institution-type Shannon at a publisher whose journals claim to serve practitioners is a red flag. Use it alongside the country diversity score: some journals are internationally broad but institutionally monochrome (globally sourced academics) while others are the reverse.
Limitations. ROR org_type is itself a simplification. A university hospital that publishes clinical trials sits at education in ROR even though the relevant institutional identity for editorial work is healthcare. A research institute federated with a university (e.g. Max Planck, CNRS) may be tagged as facility or government depending on ROR's classification. Edge cases push this indicator by a few percentage points in either direction — interpret differences of 0.05 or less as noise.
Seniority (academic age)
Academic age = current year (2026) − year of each editor's first OpenAlex- indexed publication. The panel reports the median age, the interquartile range (IQR = 75th percentile − 25th percentile), and the mean across unique editors at the entity.
A low IQR with a low median signals a board dominated by one career stage — typically early-career editors clustered in a narrow age band. A high IQR means the board spans junior and senior editors together, which is what a healthy generational mix looks like. A high median with a large IQR indicates an entrenched senior board with some token early-career presence.
Limitations. OpenAlex indexing is biased toward STEM and toward English-language publishing. Authors who first published in a non-indexed venue appear younger than they are, and entire humanities subfields have lower academic-age figures across the board for reasons that have nothing to do with editorial policy.
References
- Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27, 379–423. — Original definition of information entropy. PDF
- Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. Journal of Theoretical Biology 13, 131–144. — Introduced the J = H / ln(k) normalisation used here. doi:10.1016/0022-5193(66)90013-0 · alternate sources: Google Scholar · Wikipedia summary.
- Jost, L. (2006). Entropy and diversity. Oikos 113, 363–375. — Modern reference on why Shannon is the right functional form for diversity comparisons and how to interpret the normalised value. doi:10.1111/j.2006.0030-1299.14714.x
- Wikipedia — Shannon diversity index (accessible introduction with worked examples).
Limitations and disclaimers
Gender inference and coverage bias
Gender is algorithmically inferred from the editor's first name combined with their country of affiliation using WGND 2.0 (the World Gender Name Dictionary by Raffo & Lax-Martinez, WIPO 2021). The country signal disambiguates names whose gender flips across cultures — e.g. "Andrea" is male in Italy but female in the United States. The legacy gender-guesser library is retained as a tertiary fallback for names absent from WGND. Gender is not self-reported gender identity. gender_prob carries the WGND weight in [0,1]; gender_nobs the WGND sample size; gender_source the provenance (wgnd_country / wgnd_global / gender_guesser / unknown).
Crucially, coverage is not uniform across countries. WGND 2.0 covers ~3.5M unique first names across 195 countries, materially improving coverage on Slavic, Arabic, South Asian, and African names compared to the older Latin-script-centric gender-guesser dictionary. The largest gain is on East Asian editorial-board members: classification of editors affiliated to Chinese institutions rose from 8.0% under the v2.6 gender-guesser pipeline to 44.6% under the v2.7 WGND pipeline, and South Korea from 11.6% to 40.2% (after applying the gender_prob ≥ 0.75 confidence floor described below). Transliterated CJK-script names from Taiwan (19.2%) and Hong Kong (59.4%) remain the principal residual gap. Every per-entity page on this site reports pct_gender_classified alongside the female share, and entities below 60% coverage display a prominent warning band: the reported female share is then a fraction of the classifiable minority, not of the whole board.
Concretely in the current release:
- Italy, Germany, UK, US, France: 80–95% classified — the female share is trustworthy
- Japan: around 80% classified — reasonably trustworthy
- China: around 34% classified — two-thirds of Chinese editors' names are unclassified and the displayed female share only covers the 34%
- South Korea, Taiwan, India: 16–41% classified — treat per-country female shares as indicative at best
When comparing countries, compare the pct_gender_classified values first; gaps in coverage often explain away apparent gender imbalances. The global pct_female headline (currently 33.0% of unique editors with a resolved gender; 82.0% of unique editors are gender-classified) is computed the same way and inherits the same bias — it is a lower bound on the share in well-covered regions and a less meaningful number in poorly-covered ones.
Confidence floor: gender_prob ≥ 0.75
v2.7 applies a confidence floor of gender_prob ≥ 0.75 at WGND lookup time. Matches with weights below the floor are demoted to gender = 'unknown', gender_source = 'unknown', while the raw gender_prob and gender_nobs stay populated for transparency. The threshold was set empirically from a 100-row manual validation sample drawn from v2.7 (50 records from gender_source = wgnd_country and 50 from wgnd_global): pre-threshold precision was 97/100 (97.0%), with 3 mis-classifications all on the wgnd_global layer at probabilities 0.50, 0.52 and 0.71. Applying the gender_prob ≥ 0.75 floor demotes those 3 wrong calls to gender = 'unknown', leaving the resolved post-threshold subset at 95/95 = 100.0% precision on this sample at a coverage cost of ~5 pp. Researchers who want the raw v2.7 inferences (no floor) can recover them by filtering the master parquet on the unmodified gender_prob column or by calling the lookup with wgnd.annotate(..., min_prob=0.0).
Indexing as quality proxy
Indexing columns (indexed_pubmed, indexed_scopus, etc.) reflect database membership status at scraping time. Absence from an index does not necessarily indicate low quality — new, regional, or specialized journals may not yet be indexed.
Affiliation currency
Affiliations were scraped from publisher websites and may not reflect editors' current institutional appointments.
Data sources
This project relies on several open data sources and APIs. We are grateful to the teams behind each of these resources:
| Source | Used for | License |
|---|---|---|
| OpenAlex | Bibliometrics (h-index, citations), journal classification, journal metrics, funding sources | CC0 |
| ROR (Research Organization Registry) | Institution canonicalization, geolocation (country, city, coordinates) | CC0 |
| WGND 2.0 (Raffo & Lax-Martinez, WIPO) | Country-aware gender inference from first names (~3.5M unique names, 195 countries) | CC0 |
| gender-guesser | Tertiary gender-inference fallback for names absent from WGND | GPL-3.0 |
| Scopus Source List | Journal indexing status (indexed_scopus) | Publicly available list |
| Web of Science Master Journal List | Journal indexing status (indexed_wos) | Publicly available list |
| PubMed / NLM Catalog | Journal indexing status (indexed_pubmed) | Public domain |
| DOAJ (Directory of Open Access Journals) | Open access status, journal indexing | CC-BY-SA |
| COPE (Committee on Publication Ethics) | Publisher ethics membership | Public list |
| Norwegian Register for Scientific Journals (NPI/NSD) | Norwegian Publishing Indicator level and classification | Open data |
| ORCID | Editor unique identifiers | CC0 (public data) |
| probablepeople | Name parsing and validation | MIT |
Acknowledgements
Open Editors Plus builds upon and is inspired by the original Open Editors project by Nishikawa-Pacher, Heck, and Schoch. We are deeply grateful to the teams behind OpenAlex, ROR, DOAJ, ORCID, and all the other open data initiatives that make this work possible. Open science infrastructure is a public good, and this project exists because of their commitment to open data.
Ethics statement
This dataset contains only publicly available information from journal editorial board pages. No private or restricted data sources were used. The scraping respects publisher rate limits and robots.txt directives. Race/ethnicity inference data was computed for internal analysis only and is not included in the public release. Gender inference is provided with full methodology transparency and confidence scores to enable responsible use.