Datasets
Scholar Data currently indexes millions of datasets from thousands of repositories worldwide. This page describes what's in the database, where it comes from, and what to do if your dataset isn't showing up.
What's in the Database
DataCite
DataCite is a global infrastructure that enables repositories and institutions to register DOIs for datasets and other research outputs. Because DOI has become the primary persistent identifier for datasets, DataCite holds metadata for millions of datasets from thousands of repositories worldwide.
Scholar Data includes metadata for all DataCite-registered datasets as of September 30, 2025.
Electron Microscopy Data Bank (EMDB)
Scholar Data also includes datasets from the Electron Microscopy Data Bank (EMDB), an international repository for three-dimensional electron microscopy density maps. EMDB datasets are not assigned DOIs. Each is identified by an accession number in the format EMD-XXXXX.
All EMDB datasets published as of September 30, 2025 are included.
Coverage Summary
| Source | Datasets | Repositories | Last Updated |
|---|---|---|---|
| DataCite | 49,009,522 | 17,306 | September 30, 2025 |
| EMDB | 51,645 | 1 | September 30, 2025 |
| Total | 49,061,167 | 17,307 |
The database spans a large number of research fields, all repository types (generalist, domain-specific, and institutional), publication years from 1950 to 2026, and a wide variety of data types, licenses, and access levels.
Can't Find Your Dataset?
If your dataset doesn't appear when you search, it may not yet be in the Scholar Data database. This can happen if:
- It was deposited after September 30, 2025
- It is not registered with DataCite and is not in EMDB
In these cases, you can still evaluate the impact of your dataset from the Evaluate Datasets page, but cannot add it to your profile yet (this is something we are working on).
Processing Notes
Metadata Processing
To make the metadata manageable in subsequent pipeline steps, a reduced version was conserved for each dataset retaining only the fields relevant to S-index calculation:
- Identifier
- URL
- Title
- Publication year
- Creator details (name, affiliation, identifiers)
- Publisher
- Description
- Subjects / keywords
Publication Year Correction
A small number of DataCite datasets contained likely erroneous publication years (for example, dates in the 1400s or future dates beyond January 2026), most likely due to metadata entry errors at the time of deposit. For any dataset with a publication year outside the range 1950–January 2026, the DOI creation date was used as a fallback publication date.