Datasets Fondamentaux

Dataset	Année	Tâche	Taille
MNIST	1998	Reconnaissance chiffres	70k images
CIFAR-10/100	2009	Classification objets	60k images
ImageNet	2009	Classification 1000 classes	1.2M images
MS COCO	2014	Detection + segmentation + captioning	330k images
SQuAD	2016	Question-answering	100k questions
GLUE / SuperGLUE	2018/2019	NLP benchmark multi-tâches
Common Crawl	—	Corpus texte pour LLMs	500B+ tokens
RedPajama	2023	Open-source CC, reproduit LLaMA	1.2T tokens
FineWeb	2024	Dataset pour LLM sans copyright (Hugging Face)	15T tokens

ArtNotes