Methodology
Overview
Open Editors Plus is a two-stage pipeline that (1) scrapes editorial board listings from publisher websites, then (2) enriches each record with institutional, demographic, and bibliometric metadata from open APIs.
Stage 1: Editorial board scraping
The scraper targets 48 academic publishers and extracts editor names, roles, affiliations, and ORCIDs from publicly available journal editorial board pages. Five complementary scraping strategies are employed depending on publisher infrastructure:
Static HTML
Requests + BeautifulSoup for publishers with server-rendered pages (majority of publishers).
Dynamic rendering
Playwright/Chromium for JavaScript-heavy pages that require browser execution.
REST APIs
Direct API access for publishers with structured endpoints (IEEE, ORCID v3).
Stealth browser (Crawl4AI)
Headless Chromium with anti-detection for Cloudflare-protected sites (Elsevier, AIP).
LLM fallback (Ollama)
Local Qwen 2.5 32B model for heterogeneous layouts where CSS/XPath parsing fails. Used as a last resort.
Data quality measures
- Encoding repair: Three-layer mojibake correction (ftfy, algorithmic reversal, table-based replacement) for names in non-Latin scripts.
- Affiliation cleaning: Universal cleaning pipeline strips roles, credentials, dates, and junk from all publisher affiliations.
- Checkpoint resumption: Completed (publisher, journal) pairs are tracked, enabling interruption-safe runs.
- Rate limiting: 1.5-4 second delays between requests to respect publisher infrastructure.
Stage 2: Enrichment pipeline
Each scraped record passes through a multi-stage enrichment pipeline using open data sources:
| Stage | Source | Fields added |
|---|---|---|
| Name validation | probablepeople (ML) | Parsed name components |
| Gender inference | gender-guesser | gender, gender_prob |
| Institution canonicalization | ROR API | ror_id, ror_name, ror_country, org_type, lat/lon |
| Field classification | OpenAlex API | scientific_domain/field/subfield/topic |
| Bibliometrics | OpenAlex API | h_index, total_publications, total_citations, academic_age |
| Journal indexing | PubMed, Scopus, WoS, DOAJ, COPE lists | indexed_pubmed/scopus/wos/doaj/cope |
| Norwegian index | NPI database | npi_level, npi_discipline, npi_field |
| Journal metrics | OpenAlex API | journal_h_index, mean_citedness, impact_quartile |
| Funding sources | OpenAlex API | top_funder_1/2/3 |
| Board-level diversity | Computed | board_size, board_pct_female, country_count, country_hhi |
Limitations and disclaimers
Gender inference
Gender was algorithmically inferred from first names using the gender-guesser Python library. This is not self-reported gender identity. The gender_prob field indicates inference confidence. Researchers should interpret with caution and consider filtering by confidence threshold.
Indexing as quality proxy
Indexing columns (indexed_pubmed, indexed_scopus, etc.) reflect database membership status at scraping time. Absence from an index does not necessarily indicate low quality — new, regional, or specialized journals may not yet be indexed.
Affiliation currency
Affiliations were scraped from publisher websites and may not reflect editors' current institutional appointments.
Data sources
This project relies on several open data sources and APIs. We are grateful to the teams behind each of these resources:
| Source | Used for | License |
|---|---|---|
| OpenAlex | Bibliometrics (h-index, citations), journal classification, journal metrics, funding sources | CC0 |
| ROR (Research Organization Registry) | Institution canonicalization, geolocation (country, city, coordinates) | CC0 |
| gender-guesser | Gender inference from first names | GPL-3.0 |
| Scopus Source List | Journal indexing status (indexed_scopus) | Publicly available list |
| Web of Science Master Journal List | Journal indexing status (indexed_wos) | Publicly available list |
| PubMed / NLM Catalog | Journal indexing status (indexed_pubmed) | Public domain |
| DOAJ (Directory of Open Access Journals) | Open access status, journal indexing | CC-BY-SA |
| COPE (Committee on Publication Ethics) | Publisher ethics membership | Public list |
| Norwegian Register for Scientific Journals (NPI/NSD) | Norwegian Publishing Indicator level and classification | Open data |
| ORCID | Editor unique identifiers | CC0 (public data) |
| probablepeople | Name parsing and validation | MIT |
Acknowledgements
Open Editors Plus builds upon and is inspired by the original Open Editors project by Nishikawa-Pacher, Heck, and Schoch. We are deeply grateful to the teams behind OpenAlex, ROR, DOAJ, ORCID, and all the other open data initiatives that make this work possible. Open science infrastructure is a public good, and this project exists because of their commitment to open data.
Ethics statement
This dataset contains only publicly available information from journal editorial board pages. No private or restricted data sources were used. The scraping respects publisher rate limits and robots.txt directives. Race/ethnicity inference data was computed for internal analysis only and is not included in the public release. Gender inference is provided with full methodology transparency and confidence scores to enable responsible use.