Julia's Categorical Array
Overview
The CategoricalArrays.jl
package in Julia serves a similar function to factor
in R.
Code
julia> A = ["red", "blue", "red", "green"]
4-element Vector{String}:
"red"
"blue"
"red"
"green"
julia> B = categorical(A)
4-element CategoricalArray{String,1,UInt32}:
"red"
"blue"
"red"
"green"
julia> levels(B)
3-element Vector{String}:
"blue"
"green"
"red"
categorical()
The categorical()
function allows for casting a regular array to a categorical array.
levels()
With the levels()
function, one can view the categories. Naturally, there are no duplicates in categories, and even if an element corresponding to a category is missing from the array, the category itself remains.
julia> B[2] = "red"; B
4-element CategoricalArray{String,1,UInt32}:
"red"
"red"
"red"
"green"
julia> levels(B)
3-element Vector{String}:
"blue"
"green"
"red"
This characteristic of maintaining categories regardless of the array’s state is very useful in certain coding contexts. It’s particularly beneficial in data analysis tasks, where subsets of the dataset are frequently handled. Knowing the categorical array in such cases can be a great help.
Optimization
Technically, instead of using levels()
, using unique()
on a regular array could achieve a similar implementation.
julia> @time for t in 1:10^6
unique(A)
end
0.543157 seconds (6.00 M allocations: 579.834 MiB, 17.33% gc time)
julia> @time for t in 1:10^6
levels(B)
end
0.013324 seconds
However, the speed difference is about 40 times. Since the categories get updated every time the array changes, there’s no need to undergo any separate computation process, allowing for immediate referencing.
Full Code
using CategoricalArrays
A = ["red", "blue", "red", "green"]
B = categorical(A)
levels(B)
B[2] = "red"; B
levels(B)
@time for t in 1:10^6
unique(A)
end
@time for t in 1:10^6
levels(B)
end
Environment
- OS: Windows
- julia: v1.6.3
- CategoricalArrays v0.10.2