Skip to content

S-index Calculation

This page describes how S-index scores were calculated for the researcher profiles autogenerated and shown on Scholar Data.

These S-index scores were computed automatically for demo and validation purposes only. They are not intended to represent a researcher's definitive S-index. For an accurate S-index, researchers should create a profile, claim their datasets, and let Scholar Data compute their score from a verified, curated dataset list. The process to compute the S-index for those researcher created profiles remains the same.

Building Author Profiles

A DuckDB table was constructed by expanding the dataset corpus by author, producing a table of 216M+ rows, one row per dataset per author. Authors were then regrouped across datasets using the following strategy:

  • ORCID or other persistent identifier: used where available for unambiguous author matching
  • Name and affiliation set: used for authors without a persistent identifier

Authors listed as organizations in the DataCite creators field were excluded. After regrouping and deduplication, 1,032,546 unique authors were identified.

Grouping MethodAuthorsShare
Identifier (e.g. ORCID)346,52433.6%
Name / affiliation set686,02266.4%
Total1,032,546100%

Each author was assigned a primary research field based on the research field of the majority of their datasets.

S-index Calculation

Each author's S-index was calculated as the sum of the D-index scores of all datasets attributed to them, following the S-index formula. The auto-generated profiles and their S-index scores are browsable on Scholar Data's author search page.

Limitations of Auto-Generated Scores

Because author profiles were assembled automatically from metadata, without researcher input, the scores shown carry inherent limitations:

  • Author disambiguation is imperfect. Name/affiliation matching (66.4% of profiles) can conflate different researchers with similar names or split a single researcher's work across multiple profiles.
  • Dataset attribution may be incomplete. Datasets not registered in DataCite or EMDB, or with missing author metadata, are not reflected in auto-generated scores.
  • Scores are a point-in-time snapshot. They reflect citations, mentions, and FAIR scores as of January 2026 and will not update automatically.

Researchers who create a profile and manually claim their datasets will get a more accurate and up-to-date S-index that reflects their actual sharing footprint.

Documentation written with assistance from Claude by Anthropic.