Iris Dataset

Overview¹

The Iris dataset refers to a dataset about the observation records of iris flowers, created by the American botanist, Edgar Anderson, and introduced by the British statistician, Ronald Fisher².

Description

It is the most commonly used dataset in machine learning and data analysis practice.³

It consists of data from observing 50 flowers each of the three species of iris: setosa, versicolor, and virginica. It measures each flower’s petal length, petal width, sepal length, and sepal width, making it information on 150 flowers’ species, petal length, petal width, sepal length, and sepal width, formatted as a 150-row and 5-column data frame.

How to Use

Julia

In Julia, it can be used with the machine learning datasets package, MLDatasets.jl. Installation is required when using it for the first time. Additionally, as loading it as a data frame is the default setting, DataFrames.jl is also needed.

julia> using MLDatasets
julia> using DataFrames

julia> X = Iris()
dataset Iris:
  metadata   =>    Dict{String, Any} with 4 entries
  features   =>    150×4 DataFrame
  targets    =>    150×1 DataFrame
  dataframe  =>    150×5 DataFrame

julia> X[:]
(features = 150×4 DataFrame
 Row │ sepallength  sepalwidth  petallength  petalwidth
     │ Float64      Float64     Float64      Float64
─────┼──────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2
   2 │         4.9         3.0          1.4         0.2
   3 │         4.7         3.2          1.3         0.2
  ⋮  │      ⋮           ⋮            ⋮           ⋮
 149 │         6.2         3.4          5.4         2.3
 150 │         5.9         3.0          5.1         1.8
                                        145 rows omitted, targets = 150×1 DataFrame
 Row │ class
     │ String15
─────┼────────────────
   1 │ Iris-setosa
   2 │ Iris-setosa
   3 │ Iris-setosa
  ⋮  │       ⋮
 149 │ Iris-virginica
 150 │ Iris-virginica
      145 rows omitted)

Setting the option to as_df=false allows getting the data not as a data frame but as a tuple.

julia> X = Iris(as_df=false)[:]
(features = [5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], targets = InlineStrings.String15[InlineStrings.String15("Iris-setosa") InlineStrings.String15("Iris-setosa") … InlineStrings.String15("Iris-virginica") InlineStrings.String15("Iris-virginica")])

julia> typeof(X)
NamedTuple{(:features, :targets), Tuple{Matrix{Float64}, Matrix{InlineStrings.String15}}}

Iris().features, Iris().targets, Iris().dataframes can be used to obtain the features, classes, and a data frame combining these two, respectively.

julia> Iris().dataframe
150×5 DataFrame
 Row │ sepallength  sepalwidth  petallength  petalwidth  class
     │ Float64      Float64     Float64      Float64     String15
─────┼──────────────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  Iris-setosa
   2 │         4.9         3.0          1.4         0.2  Iris-setosa
   3 │         4.7         3.2          1.3         0.2  Iris-setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
 149 │         6.2         3.4          5.4         2.3  Iris-virginica
 150 │         5.9         3.0          5.1         1.8  Iris-virginica
                                                        145 rows omitted

Environment

OS: Windows11
Version: Julia v1.8.2, MLDatasets v0.7.6, DataFrames v1.3.6

권건우·허령, 야사와 만화로 배우는 인공지능 2, p230 ↩︎
The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis ↩︎
https://archive.ics.uci.edu/ml/index.php ↩︎