How to Check DataFrame Size in Julia
Overview
nrow()
, ncol()
, and size()
can be used. Unlike with R, length()
results in an error.
Code
julia> df = DataFrame(rand(100000,5), :auto)
100000×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
────────┼─────────────────────────────────────────────────────
1 │ 0.474921 0.942137 0.0523668 0.588696 0.0176242
2 │ 0.842828 0.910385 0.216194 0.794668 0.664883
3 │ 0.0350312 0.96542 0.837923 0.920311 0.748409
4 │ 0.613249 0.731643 0.941826 0.688649 0.161736
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
99998 │ 0.767794 0.242687 0.965885 0.557483 0.723849
99999 │ 0.743936 0.67815 0.529923 0.247698 0.861302
100000 │ 0.628269 0.252583 0.985485 0.24541 0.942741
99993 rows omitted
df
is a dataframe with 100,000 rows and 5 columns.
julia> nrow(df)
100000
julia> ncol(df)
5
julia> size(df)
(100000, 5)
nrow()
and ncol()
return the number of rows and columns, respectively, and size()
returns the size of the rows and columns as a tuple. By referencing them in the order of rows, columns, you can know the size of the rows and columns separately. At first glance, size()
seems much more useful, but let’s compare their performances.
Optimization
julia> @time for i in 1:10^6
nrow(df)
end
0.051730 seconds (1000.00 k allocations: 15.259 MiB)
julia> @time for i in 1:10^6
size(df)[1]
end
0.536297 seconds (3.00 M allocations: 61.035 MiB, 5.44% gc time)
Above is a comparison of the speed of nrow()
and size()
. As expected, the single-function nrow()
is faster. The test may seem forced by running it excessively, but in cases where the dataframe is significantly larger—handling big data or using size()
thinking it won’t make much difference can result in a waste of unnecessary time.
Also, there is a big difference in terms of code readability. nrow()
and ncol()
are function names commonly used in other languages and are undoubtedly the number of rows and columns, but size()
greatly reduces readability due to the suffixing index. If possible, it is advised to use nrow()
and ncol()
.
Full Code
using DataFrames
df = DataFrame(rand(100000,5), :auto)
nrow(df)
ncol(df)
size(df)
@time for i in 1:10^6
nrow(df)
end
@time for i in 1:10^6
size(df)[1]
end
Environment
- OS: Windows
- julia: v1.6.3