S-index Calculation
This page describes how S-index scores were calculated for the researcher profiles autogenerated and shown on Scholar Data.
These S-index scores were computed automatically for demo and validation purposes only. They are not intended to represent a researcher's definitive S-index. For an accurate S-index, researchers should create a profile, claim their datasets, and let Scholar Data compute their score from a verified, curated dataset list. The process to compute the S-index for those researcher created profiles remains the same.
Building Author Profiles
A DuckDB table was constructed by expanding the dataset corpus by author, producing a table of 216M+ rows, one row per dataset per author. Authors were then regrouped across datasets using the following strategy:
- ORCID or other persistent identifier: used where available for unambiguous author matching
- Name and affiliation set: used for authors without a persistent identifier
Authors listed as organizations in the DataCite creators field were excluded. After regrouping and deduplication, 1,032,546 unique authors were identified.
| Grouping Method | Authors | Share |
|---|---|---|
| Identifier (e.g. ORCID) | 346,524 | 33.6% |
| Name / affiliation set | 686,022 | 66.4% |
| Total | 1,032,546 | 100% |
Each author was assigned a primary research field based on the research field of the majority of their datasets.
S-index Calculation
Each author's S-index was calculated as the sum of the D-index scores of all datasets attributed to them, following the S-index formula. The auto-generated profiles and their S-index scores are browsable on Scholar Data's author search page.
Limitations of Auto-Generated Scores
Because author profiles were assembled automatically from metadata, without researcher input, the scores shown carry inherent limitations:
- Author disambiguation is imperfect. Name/affiliation matching (66.4% of profiles) can conflate different researchers with similar names or split a single researcher's work across multiple profiles.
- Dataset attribution may be incomplete. Datasets not registered in DataCite or EMDB, or with missing author metadata, are not reflected in auto-generated scores.
- Scores are a point-in-time snapshot. They reflect citations, mentions, and FAIR scores as of January 2026 and will not update automatically.
Researchers who create a profile and manually claim their datasets will get a more accurate and up-to-date S-index that reflects their actual sharing footprint.