MNIST Database
Overview1
$$ \includegraphics[height=20em]{https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png} $$
The MNIST database refers to a dataset of digit handwriting from American high school students and Census Bureau employees. It is commonly known as [MNIST].
Description
This dataset is frequently used as an example for beginners in machine learning/deep learning. NIST originally collected handwritten data in the following format for the evaluation of character recognition technology for automated sorting of handwritten postal codes. Yann LeCun took this handwritten data from high school students and Census Bureau employees, processed it, and created the MNIST. The image size is 28 x 28, and it consists of 60,000 training sets and 10,000 test sets.
$$ \includegraphics[height=30em]{https://www.nist.gov/sites/default/files/styles/960_x_960_limit/public/images/2019/04/27/sd19.jpg?itok=oETq77cZ} $$
How to Use
Julia
In Julia, the MNIST dataset can be used with the machine learning dataset package MLDatasets.jl
. By default, it loads the training set in Float32
type. There are options to change this. The available methods are as follows:
- dataset[i]: Returns a tuple of the i-th features and target.
- dataset[:]: Returns a tuple of all features and target.
- length(dataset): Returns the number of data.
- convert2image(dataset, i): Converts the i-th data into a grayscale image. The
ImageShow.jl
package is required.
julia> using MLDatasets
julia> train = MNIST()
dataset MNIST:
metadata => Dict{String, Any} with 3 entries
split => :train
features => 28×28×60000 Array{Float32, 3}
targets => 60000-element Vector{Int64}
julia> test = MNIST(Float64, :test)
dataset MNIST:
metadata => Dict{String, Any} with 3 entries
split => :test
features => 28×28×10000 Array{Float64, 3}
targets => 10000-element Vector{Int64}
julia> length(train), length(test)
(60000, 10000)
julia> using Plots
julia> using ImageShow
julia> train.targets[1]
5
julia> heatmap(convert2image(train, 1))
Since the labels are given as integers, one-hot encoding needs to be done separately.
julia> train.targets[1:5]
5-element Vector{Int64}:
5
0
4
1
9
julia> using Flux
julia> Flux.onehotbatch(train.targets[1:5], 0:9)
10×5 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
⋅ 1 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 1 ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 1 ⋅ ⋅
1 ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ 1
Environment
- OS: Windows11
- Version: Julia v1.8.2, MLDatasets v0.7.6, Plots v1.36.1, ImageShow v0.3.6, Flux v0.13.7
Gun-Woo Kwon and Ryeong Heo, Learning AI through Night History and Comics 2, p68 ↩︎