logo

Commonly Used Datasets in Machine Learning 📂Machine Learning

Commonly Used Datasets in Machine Learning

Computer Vision

MNIST

This is the first dataset that one encounters when studying machine learning. It is pronounced as [em-nist] and consists of hand-written digit images of size $28\times 28$. The dataset includes 60,000 training images and 10,000 testing images[^1].

MnistExamples.png

CIFAR-10, CIFAR-100

CIFAR-10, pronounced as [cypher-ten], includes 60,000 images in 10 different categories, with images of size $32\times 32$. It is composed of 50,000 training images and 10,000 testing images.

Screenshot2020-09-28at14.55.24.png

CIFAR-100 is similar to CIFAR-10 in that it includes 60,000 color images of size $32\times 32$, but it has 100 categories. Each category contains 600 images, with 500 for training and 100 for testing.

Screenshot2020-09-28at15.02.19.png

ImageNet

This dataset contains images for over 20,000 categories, with 200 images for each category. For convenience, it includes training data of 1,000 categories and 1.2 million images.

Screenshot2020-09-28at15.04.23.png

SVHN(The Street View House Number)

This dataset consists of house number plate photos collected by Google when developing the Street View feature in Google Maps. It is composed of color images of size $32\times 32$.

Screenshot2020-09-28at15.16.33.png

Classification, Clustering

Iris Dataset

This dataset on the observation records of the iris flower was created by botanist Edgar Anderson and statistician Ronald Fisher. It is widely used as a statistical and machine learning introductory example.

iris.png