API AccessAccess dataset files directly from scripts, code, or AI agents.
Browse dataset files
Access dataset files directly from scripts, code, or AI agents.
Each file has a stable URL (r-link) that you can use directly in scripts, apps, or AI agents. These URLs are permanent and safe to hardcode.
Start with these files — they give you everything you need to understand and access the dataset.
- 1. Fetch datapackage.json to inspect schema and resources
- 2. Download data resources listed in datapackage.json
- 3. Read README.md for full context
Data Views
Data Previews
All AI Models
Schema
| name | type | format | description |
|---|---|---|---|
| Model | string | Name of the AI model | |
| Domain | string | Domain(s) the model operates in (e.g. Language, Vision) | |
| Task | string | Task(s) the model is designed for | |
| Organization | string | Organization(s) that developed the model | |
| Authors | string | Authors of the model or associated paper | |
| Publication date | date | %Y-%m-%d | Date the model was published or released |
| Reference | string | Citation reference for the model | |
| Link | string | URL to model paper or announcement | |
| Citations | number | Number of citations | |
| Notability criteria | string | Criteria that make this model notable | |
| Notability criteria notes | string | Additional notes on notability criteria | |
| Parameters | number | Number of model parameters | |
| Parameters notes | string | Notes on parameter count | |
| Training compute (FLOP) | number | Total training compute in floating point operations | |
| Training compute notes | string | Notes on training compute estimate | |
| Training dataset | string | Name or description of the training dataset | |
| Training dataset notes | string | Notes on the training dataset | |
| Training dataset size (datapoints) | number | Number of datapoints in the training dataset | |
| Dataset size notes | string | Notes on dataset size | |
| Training time (hours) | number | Total training time in hours | |
| Training time notes | string | Notes on training time estimate | |
| Training hardware | string | Hardware used for training (e.g. A100, H100) | |
| Approach | string | Modeling approach or architecture type | |
| Confidence | string | Confidence level of the data entries | |
| Abstract | string | Abstract of the associated paper | |
| Epochs | number | Number of training epochs | |
| Benchmark data | string | Benchmark evaluation data | |
| Model accessibility | string | Accessibility of the model weights (e.g. open, closed) | |
| Country (of organization) | string | Country where the developing organization is based | |
| Base model | string | Base model this model was fine-tuned from, if any | |
| Finetune compute (FLOP) | number | Compute used for fine-tuning in FLOP | |
| Finetune compute notes | string | Notes on fine-tune compute estimate | |
| Hardware quantity | number | Number of hardware units used for training | |
| Hardware utilization (MFU) | number | Model FLOP utilization (MFU) of training hardware | |
| Last modified | string | Timestamp when the record was last modified | |
| Training cloud compute vendor | string | Cloud vendor used for training compute | |
| Training data center | string | Data center used for training | |
| Archived links | string | Archived URLs for the model or paper | |
| Batch size | number | Training batch size | |
| Batch size notes | string | Notes on batch size | |
| Organization categorization | string | Category of the developing organization (e.g. Industry, Academia) | |
| Foundation model | boolean | Whether this is a foundation model | |
| Training compute lower bound | number | Lower bound estimate of training compute in FLOP | |
| Training compute upper bound | number | Upper bound estimate of training compute in FLOP | |
| Training chip-hours | number | Total chip-hours used for training | |
| Training code accessibility | string | Accessibility of training code | |
| Accessibility notes | string | Notes on accessibility of model or code | |
| Organization categorization (from Organization) | string | Organization category derived from organization field | |
| Possibly over 1e23 FLOP | boolean | Whether training compute may exceed 1e23 FLOP | |
| Training compute cost (2023 USD) | number | Estimated training compute cost in 2023 US dollars | |
| Utilization notes | string | Notes on hardware utilization | |
| Numerical format | string | Numerical precision format used in training (e.g. FP16, BF16) | |
| Frontier model | boolean | Whether this model was a frontier model at the time of release | |
| Training power draw (W) | number | Power consumption during training in watts | |
| Training compute estimation method | string | Method used to estimate training compute | |
| Hugging Face developer id | string | Hugging Face developer or organization identifier | |
| Post-training compute (FLOP) | number | Compute used for post-training (RLHF, fine-tuning, etc.) in FLOP | |
| Post-training compute notes | string | Notes on post-training compute estimate | |
| Hardware utilization (HFU) | number | Hardware FLOP utilization (HFU) during training |
Notable AI Models
Schema
| name | type | format | description |
|---|---|---|---|
| Model | string | Name of the AI model | |
| Organization | string | Organization(s) that developed the model | |
| Publication date | date | %Y-%m-%d | Date the model was published or released |
| Domain | string | Domain(s) the model operates in (e.g. Language, Vision) | |
| Task | string | Task(s) the model is designed for | |
| Parameters | number | Number of model parameters | |
| Parameters notes | string | Notes on parameter count | |
| Training compute (FLOP) | number | Total training compute in floating point operations | |
| Training compute notes | string | Notes on training compute estimate | |
| Training dataset | string | Name or description of the training dataset | |
| Training dataset size (datapoints) | number | Number of datapoints in the training dataset | |
| Dataset size notes | string | Notes on dataset size | |
| Confidence | string | Confidence level of the data entries | |
| Link | string | URL to model paper or announcement | |
| Reference | string | Citation reference for the model | |
| Citations | number | Number of citations | |
| Authors | string | Authors of the model or associated paper | |
| Abstract | string | Abstract of the associated paper | |
| Organization categorization | string | Category of the developing organization (e.g. Industry, Academia) | |
| Country (of organization) | string | Country where the developing organization is based | |
| Notability criteria | string | Criteria that make this model notable | |
| Notability criteria notes | string | Additional notes on notability criteria | |
| Epochs | number | Number of training epochs | |
| Training time (hours) | number | Total training time in hours | |
| Training time notes | string | Notes on training time estimate | |
| Training hardware | string | Hardware used for training (e.g. A100, H100) | |
| Hardware quantity | number | Number of hardware units used for training | |
| Hardware utilization (MFU) | number | Model FLOP utilization (MFU) of training hardware | |
| Training compute cost (2023 USD) | number | Estimated training compute cost in 2023 US dollars | |
| Compute cost notes | string | Notes on compute cost estimate | |
| Training power draw (W) | number | Power consumption during training in watts | |
| Base model | string | Base model this model was fine-tuned from, if any | |
| Finetune compute (FLOP) | number | Compute used for fine-tuning in FLOP | |
| Finetune compute notes | string | Notes on fine-tune compute estimate | |
| Batch size | number | Training batch size | |
| Batch size notes | string | Notes on batch size | |
| Model accessibility | string | Accessibility of the model weights (e.g. open, closed) | |
| Training code accessibility | string | Accessibility of training code | |
| Inference code accessibility | string | Accessibility of inference code | |
| Accessibility notes | string | Notes on accessibility of model or code | |
| Numerical format | string | Numerical precision format used in training (e.g. FP16, BF16) | |
| Frontier model | boolean | Whether this model was a frontier model at the time of release | |
| Hardware acquisition cost | number | Cost of acquiring the training hardware in USD | |
| Hardware utilization (HFU) | number | Hardware FLOP utilization (HFU) during training | |
| Training compute cost (cloud) | number | Estimated training compute cost using cloud pricing in USD | |
| Training compute cost (upfront) | number | Estimated training compute cost using upfront hardware pricing in USD |
Large-Scale AI Models
Schema
| name | type | format | description |
|---|---|---|---|
| Model | string | Name of the AI model | |
| Domain | string | Domain(s) the model operates in (e.g. Language, Vision) | |
| Task | string | Task(s) the model is designed for | |
| Authors | string | Authors of the model or associated paper | |
| Model accessibility | string | Accessibility of the model weights (e.g. open, closed) | |
| Link | string | URL to model paper or announcement | |
| Citations | number | Number of citations | |
| Reference | string | Citation reference for the model | |
| Publication date | date | %Y-%m-%d | Date the model was published or released |
| Organization | string | Organization(s) that developed the model | |
| Parameters | number | Number of model parameters | |
| Parameters notes | string | Notes on parameter count | |
| Training compute (FLOP) | number | Total training compute in floating point operations | |
| Training compute notes | string | Notes on training compute estimate | |
| Training dataset | string | Name or description of the training dataset | |
| Training dataset notes | string | Notes on the training dataset | |
| Training dataset size (datapoints) | number | Number of datapoints in the training dataset | |
| Dataset size notes | string | Notes on dataset size | |
| Training time (hours) | number | Total training time in hours | |
| Training time notes | string | Notes on training time estimate | |
| Training hardware | string | Hardware used for training (e.g. A100, H100) | |
| Confidence | string | Confidence level of the data entries | |
| Abstract | string | Abstract of the associated paper | |
| Country (of organization) | string | Country where the developing organization is based | |
| Base model | string | Base model this model was fine-tuned from, if any | |
| Finetune compute (FLOP) | number | Compute used for fine-tuning in FLOP | |
| Finetune compute notes | string | Notes on fine-tune compute estimate | |
| Hardware quantity | number | Number of hardware units used for training | |
| Hardware utilization (MFU) | number | Model FLOP utilization (MFU) of training hardware | |
| Training code accessibility | string | Accessibility of training code | |
| Accessibility notes | string | Notes on accessibility of model or code | |
| Organization categorization (from Organization) | string | Organization category derived from organization field | |
| Hardware utilization (HFU) | number | Hardware FLOP utilization (HFU) during training | |
| Training compute cost (cloud) | number | Estimated training compute cost using cloud pricing in USD | |
| Training compute cost (upfront) | number | Estimated training compute cost using upfront hardware pricing in USD |
Frontier AI Models
Schema
| name | type | format | description |
|---|---|---|---|
| Model | string | Name of the AI model | |
| Domain | string | Domain(s) the model operates in (e.g. Language, Vision) | |
| Task | string | Task(s) the model is designed for | |
| Authors | string | Authors of the model or associated paper | |
| Notability criteria | string | Criteria that make this model notable | |
| Notability criteria notes | string | Additional notes on notability criteria | |
| Model accessibility | string | Accessibility of the model weights (e.g. open, closed) | |
| Link | string | URL to model paper or announcement | |
| Citations | number | Number of citations | |
| Reference | string | Citation reference for the model | |
| Publication date | date | %Y-%m-%d | Date the model was published or released |
| Organization | string | Organization(s) that developed the model | |
| Parameters | number | Number of model parameters | |
| Parameters notes | string | Notes on parameter count | |
| Training compute (FLOP) | number | Total training compute in floating point operations | |
| Training compute notes | string | Notes on training compute estimate | |
| Training dataset | string | Name or description of the training dataset | |
| Training dataset notes | string | Notes on the training dataset | |
| Training dataset size (datapoints) | number | Number of datapoints in the training dataset | |
| Dataset size notes | string | Notes on dataset size | |
| Epochs | number | Number of training epochs | |
| Inference compute (FLOP) | number | Compute per inference pass in FLOP | |
| Inference compute notes | string | Notes on inference compute estimate | |
| Training time (hours) | number | Total training time in hours | |
| Training time notes | string | Notes on training time estimate | |
| Training hardware | string | Hardware used for training (e.g. A100, H100) | |
| Approach | string | Modeling approach or architecture type | |
| Compute cost notes | string | Notes on compute cost estimate | |
| Compute sponsor categorization | string | Category of the compute sponsor | |
| Confidence | string | Confidence level of the data entries | |
| Abstract | string | Abstract of the associated paper | |
| Last modified | string | Timestamp when the record was last modified | |
| Created By | string | Person who created this record | |
| Benchmark data | string | Benchmark evaluation data | |
| Exclude | boolean | Whether this model is excluded from certain analyses | |
| Country (of organization) | string | Country where the developing organization is based | |
| Base model | string | Base model this model was fine-tuned from, if any | |
| Finetune compute (FLOP) | number | Compute used for fine-tuning in FLOP | |
| Finetune compute notes | string | Notes on fine-tune compute estimate | |
| Hardware quantity | number | Number of hardware units used for training | |
| Hardware utilization (MFU) | number | Model FLOP utilization (MFU) of training hardware | |
| Training cost trends | string | Trend information for training costs | |
| Training cloud compute vendor | string | Cloud vendor used for training compute | |
| Training data center | string | Data center used for training | |
| Archived links | string | Archived URLs for the model or paper | |
| Batch size | number | Training batch size | |
| Batch size notes | string | Notes on batch size | |
| Organization categorization | string | Category of the developing organization (e.g. Industry, Academia) | |
| Foundation model | boolean | Whether this is a foundation model | |
| Training compute lower bound | number | Lower bound estimate of training compute in FLOP | |
| Training compute upper bound | number | Upper bound estimate of training compute in FLOP | |
| Training chip-hours | number | Total chip-hours used for training | |
| Training code accessibility | string | Accessibility of training code | |
| Accessibility notes | string | Notes on accessibility of model or code | |
| Organization categorization (from Organization) | string | Organization category derived from organization field | |
| Possibly over 1e23 FLOP | boolean | Whether training compute may exceed 1e23 FLOP | |
| Training compute cost (2023 USD) | number | Estimated training compute cost in 2023 US dollars | |
| Training dataset size | number | Size of the training dataset (alternative field) | |
| Sparsity | number | Model sparsity ratio | |
| Utilization notes | string | Notes on hardware utilization | |
| Estimated over 1e25 FLOP | boolean | Whether training compute is estimated to exceed 1e25 FLOP | |
| Power per GPU | number | Power draw per GPU unit in watts | |
| Cluster total TDP | number | Total thermal design power of the training cluster in watts | |
| Base model compute | number | Training compute of the base model in FLOP | |
| Total compute - (base + finetune) | number | Total compute including base model and fine-tuning in FLOP | |
| API prices | string | API pricing information for the model | |
| Created | string | Timestamp when the record was created | |
| Inference code accessibility | string | Accessibility of inference code | |
| Numerical format | string | Numerical precision format used in training (e.g. FP16, BF16) | |
| Model versions | string | Available versions of the model | |
| Frontier model | boolean | Whether this model was a frontier model at the time of release | |
| Training power draw (W) | number | Power consumption during training in watts | |
| Benchmark evals | string | Benchmark evaluation results | |
| FLOP/$ | number | Training compute efficiency in FLOP per dollar | |
| Hardware release date | date | any | Release date of the training hardware |
| Hardware age | number | Age of the training hardware in years at time of training | |
| Hardware FP32 | number | Hardware FP32 FLOP/s throughput | |
| Hardware TF32 | number | Hardware TF32 FLOP/s throughput | |
| Hardware count | number | Number of hardware units in the training cluster | |
| Hardware TF16 | number | Hardware TF16 FLOP/s throughput | |
| Hardware FP16 | number | Hardware FP16 FLOP/s throughput | |
| Assumed precision | string | Assumed numerical precision for compute estimates | |
| Assumed hardware FLOP/s | number | Assumed hardware throughput in FLOP/s used for compute estimates | |
| Hardware type | string | Type of hardware used (e.g. GPU, TPU) | |
| Training compute estimation method | string | Method used to estimate training compute | |
| Biological model safeguards | string | Safeguards related to biological model risks | |
| BenchmarkHub-v1 | string | BenchmarkHub v1 evaluation results | |
| Hugging Face developer id | string | Hugging Face developer or organization identifier | |
| Post-training compute (FLOP) | number | Compute used for post-training (RLHF, fine-tuning, etc.) in FLOP | |
| Post-training compute notes | string | Notes on post-training compute estimate | |
| Hardware maker | string | Manufacturer of the training hardware | |
| benchmarks/models | string | Benchmark to model mapping data | |
| Maybe over 1e25 FLOP | boolean | Whether training compute may exceed 1e25 FLOP | |
| Updated dataset size | number | Updated or revised training dataset size | |
| WT103 ppl | number | WikiText-103 perplexity score | |
| WT2 ppl | number | WikiText-2 perplexity score | |
| PTB ppl | number | Penn Treebank perplexity score | |
| Distillation or synthetic data | string | Whether model was trained on distillation or synthetic data | |
| Distillation or synthetic data compute | number | Compute used to generate distillation or synthetic training data in FLOP | |
| Distillation or synthetic data compute notes | string | Notes on distillation or synthetic data compute | |
| Knowledge cutoff | string | Training data knowledge cutoff date | |
| Context window | number | Maximum context window size in tokens | |
| Hardware utilization (HFU) | number | Hardware FLOP utilization (HFU) during training | |
| Training compute cost (cloud) | number | Estimated training compute cost using cloud pricing in USD | |
| Training compute cost (upfront) | number | Estimated training compute cost using upfront hardware pricing in USD |
Data Files
| File | Description | Size | Last modified | Download |
|---|---|---|---|---|
all-ai-models | All AI models in the Epoch database (~21,000 entries). | 5.72 MB | about 1 month ago | all-ai-models |
notable-ai-models | Subset of notable AI models with richer metadata (~7,400 entries). | 1.85 MB | about 1 month ago | notable-ai-models |
large-scale-ai-models | Large-scale AI models subset (~3,600 entries). | 902 kB | about 1 month ago | large-scale-ai-models |
frontier-ai-models | Frontier AI models subset — the most capable models at each point in time (~1,600 entries). | 371 kB | about 1 month ago | frontier-ai-models |
| Files | Size | Format | Created | Updated | License | Source |
|---|---|---|---|---|---|---|
| 4 | 8.85 MB | about 2 months ago | Creative Commons Attribution 4.0 | Epoch AI — Notable AI Models |
Dataset: epoch-data-on-ai-models
This is a Frictionless Data Package.
Concepts
Data hierarchy (from broad to specific):
- Catalog = a collection of datasets (maps to a DataHub publication, one GitHub repo)
- Dataset = a coherent data concept with a defined schema and coverage — this directory
- Data file = a concrete file artifact (csv, json, parquet…) listed as a resource in datapackage.json
Dataset lifecycle — a dataset doesn't need to be complete on day one:
- capture — just a URL or note, intent to explore
- stub — minimal entry: title, description, source link, no files yet
- archived — raw files downloaded locally
- structured — cleaned, normalised, schema documented
- enriched — analysis, visualisations, derived data added
- monitored — living source, versioned and updated over time
Catalog-as-repo pattern: if the source is a portal or collection containing many datasets (e.g. a data.gov agency, an institutional archive), give it its own repo and DataHub publication — not a subfolder here.
Structure
epoch-data-on-ai-models/
datapackage.json # dataset metadata and resource list
data/ # data files (csv, json, parquet, etc.)
.datahubignore # files to exclude when pushing (gitignore syntax)
datapackage.json
Keep resources in sync with what's in data/:
{
"name": "epoch-data-on-ai-models",
"title": "Human readable title",
"description": "What this dataset is about",
"resources": [
{
"path": "data/my-file.csv",
"name": "my-file",
"mediatype": "text/csv"
}
]
}
Workflow
# Add data files to data/
# Edit datapackage.json — update resources to list them
data pack . # validate
dh push . # publish to DataHub
Key rules
- Every file in
data/that you want published must be listed inresources namein datapackage.json must be URL-safe (lowercase, hyphens)- Use
.datahubignoreto exclude scratch files, large intermediaries, etc. - It is fine to push a stub — set lifecycle stage in
datapackage.jsonas"status": "stub"if incomplete