Datasets de Référence par Domaine
| Domaine | Dataset | Taille | Tâche |
|---|---|---|---|
| Vision | ImageNet-1k | 1.2M | Classification 1000 classes |
| Vision | COCO | 330k | Detection + Segmentation |
| Vision | LAION-5B | 5B paires (image,text) | Text-to-image |
| Vision | ADE20K | 27k | Segmentation scénique |
| NLP | GLUE / SuperGLUE | ≈ 10 tasks | Benchmark NLP |
| NLP | SQuAD 2.0 | 150k | Question answering |
| NLP | Common Crawl | 500B+ tokens | Pre-training LLM |
| NLP | The Pile (EleutherAI) | 825 GB | Pre-training diversifié |
| NLP | C4 (Google) | 750 GB | Pre-training T5 |
| NLP | FineWeb (HF) | 15T tokens | Pre-training LLM (2024) |
| NLP | Dolma (AI2) | 3T tokens | Pre-training OLMo |
| Multi | YouTube-8M | 8M vidéos | Classification vidéo |
| Multi | Kinetics-700 | 700k vidéos | Action recognition |
| RL | Atari 2600 | 57 jeux | Deep RL benchmark |
| RL | DMC (DeepMind Control) | 30 tâches | Continous control |
| GNN | OGB (Open Graph Benchmark) | 7 datasets | Node/graph/link prediction |
| Audio | LibriSpeech | 1000h | ASR |
| Audio | AudioSet | 2M clips | Audio event detection |
← Reproducibilité & Experiment Tracking • 42 • Formal Verification & AI Safety →