Commonly Used Datasets in Machine Learning
Computer Vision
MNIST
This is the first dataset that one encounters when studying machine learning. It is pronounced as [em-nist] and consists of hand-written digit images of size $28\times 28$. The dataset includes 60,000 training images and 10,000 testing images[^1].
CIFAR-10, CIFAR-100
CIFAR-10, pronounced as [cypher-ten], includes 60,000 images in 10 different categories, with images of size $32\times 32$. It is composed of 50,000 training images and 10,000 testing images.
CIFAR-100 is similar to CIFAR-10 in that it includes 60,000 color images of size $32\times 32$, but it has 100 categories. Each category contains 600 images, with 500 for training and 100 for testing.
ImageNet
This dataset contains images for over 20,000 categories, with 200 images for each category. For convenience, it includes training data of 1,000 categories and 1.2 million images.
SVHN(The Street View House Number)
This dataset consists of house number plate photos collected by Google when developing the Street View feature in Google Maps. It is composed of color images of size $32\times 32$.
Classification, Clustering
Iris Dataset
This dataset on the observation records of the iris flower was created by botanist Edgar Anderson and statistician Ronald Fisher. It is widely used as a statistical and machine learning introductory example.