Datasets Fondamentaux
| Dataset | Année | Tâche | Taille |
|---|---|---|---|
| MNIST | 1998 | Reconnaissance chiffres | 70k images |
| CIFAR-10/100 | 2009 | Classification objets | 60k images |
| ImageNet | 2009 | Classification 1000 classes | 1.2M images |
| MS COCO | 2014 | Detection + segmentation + captioning | 330k images |
| SQuAD | 2016 | Question-answering | 100k questions |
| GLUE / SuperGLUE | 2018/2019 | NLP benchmark multi-tâches | |
| Common Crawl | — | Corpus texte pour LLMs | 500B+ tokens |
| RedPajama | 2023 | Open-source CC, reproduit LLaMA | 1.2T tokens |
| FineWeb | 2024 | Dataset pour LLM sans copyright (Hugging Face) | 15T tokens |
← Self-Supervised Learning • 17 • Adversarial ML & Robustesse →