Methodology

Overview

Open Editors Plus is a two-stage pipeline that (1) scrapes editorial board listings from publisher websites, then (2) enriches each record with institutional, demographic, and bibliometric metadata from open APIs.

Stage 1: Editorial board scraping

The scraper targets 48 academic publishers and extracts editor names, roles, affiliations, and ORCIDs from publicly available journal editorial board pages. Five complementary scraping strategies are employed depending on publisher infrastructure:

Static HTML

Requests + BeautifulSoup for publishers with server-rendered pages (majority of publishers).

Dynamic rendering

Playwright/Chromium for JavaScript-heavy pages that require browser execution.

REST APIs

Direct API access for publishers with structured endpoints (IEEE, ORCID v3).

Stealth browser (Crawl4AI)

Headless Chromium with anti-detection for Cloudflare-protected sites (Elsevier, AIP).

LLM fallback (Ollama)

Local Qwen 2.5 32B model for heterogeneous layouts where CSS/XPath parsing fails. Used as a last resort.

Data quality measures

  • Encoding repair: Three-layer mojibake correction (ftfy, algorithmic reversal, table-based replacement) for names in non-Latin scripts.
  • Affiliation cleaning: Universal cleaning pipeline strips roles, credentials, dates, and junk from all publisher affiliations.
  • Checkpoint resumption: Completed (publisher, journal) pairs are tracked, enabling interruption-safe runs.
  • Rate limiting: 1.5-4 second delays between requests to respect publisher infrastructure.

Stage 2: Enrichment pipeline

Each scraped record passes through a multi-stage enrichment pipeline using open data sources:

Stage Source Fields added
Name validation probablepeople (ML) Parsed name components
Gender inference gender-guesser gender, gender_prob
Institution canonicalization ROR API ror_id, ror_name, ror_country, org_type, lat/lon
Field classification OpenAlex API scientific_domain/field/subfield/topic
Bibliometrics OpenAlex API h_index, total_publications, total_citations, academic_age
Journal indexing PubMed, Scopus, WoS, DOAJ, COPE lists indexed_pubmed/scopus/wos/doaj/cope
Norwegian index NPI database npi_level, npi_discipline, npi_field
Journal metrics OpenAlex API journal_h_index, mean_citedness, impact_quartile
Funding sources OpenAlex API top_funder_1/2/3
Board-level diversity Computed board_size, board_pct_female, country_count, country_hhi

Limitations and disclaimers

Gender inference

Gender was algorithmically inferred from first names using the gender-guesser Python library. This is not self-reported gender identity. The gender_prob field indicates inference confidence. Researchers should interpret with caution and consider filtering by confidence threshold.

Indexing as quality proxy

Indexing columns (indexed_pubmed, indexed_scopus, etc.) reflect database membership status at scraping time. Absence from an index does not necessarily indicate low quality — new, regional, or specialized journals may not yet be indexed.

Affiliation currency

Affiliations were scraped from publisher websites and may not reflect editors' current institutional appointments.

Data sources

This project relies on several open data sources and APIs. We are grateful to the teams behind each of these resources:

Source Used for License
OpenAlex Bibliometrics (h-index, citations), journal classification, journal metrics, funding sources CC0
ROR (Research Organization Registry) Institution canonicalization, geolocation (country, city, coordinates) CC0
gender-guesser Gender inference from first names GPL-3.0
Scopus Source List Journal indexing status (indexed_scopus) Publicly available list
Web of Science Master Journal List Journal indexing status (indexed_wos) Publicly available list
PubMed / NLM Catalog Journal indexing status (indexed_pubmed) Public domain
DOAJ (Directory of Open Access Journals) Open access status, journal indexing CC-BY-SA
COPE (Committee on Publication Ethics) Publisher ethics membership Public list
Norwegian Register for Scientific Journals (NPI/NSD) Norwegian Publishing Indicator level and classification Open data
ORCID Editor unique identifiers CC0 (public data)
probablepeople Name parsing and validation MIT

Acknowledgements

Open Editors Plus builds upon and is inspired by the original Open Editors project by Nishikawa-Pacher, Heck, and Schoch. We are deeply grateful to the teams behind OpenAlex, ROR, DOAJ, ORCID, and all the other open data initiatives that make this work possible. Open science infrastructure is a public good, and this project exists because of their commitment to open data.

Ethics statement

This dataset contains only publicly available information from journal editorial board pages. No private or restricted data sources were used. The scraping respects publisher rate limits and robots.txt directives. Race/ethnicity inference data was computed for internal analysis only and is not included in the public release. Gender inference is provided with full methodology transparency and confidence scores to enable responsible use.