logo

How to Check DataFrame Size in Julia 📂Julia

How to Check DataFrame Size in Julia

Overview

nrow(), ncol(), and size() can be used. Unlike with R, length() results in an error.

Code

julia> df = DataFrame(rand(100000,5), :auto)
100000×5 DataFrame
    Row │ x1         x2        x3         x4        x5        
        │ Float64    Float64   Float64    Float64   Float64
────────┼─────────────────────────────────────────────────────
      1 │ 0.474921   0.942137  0.0523668  0.588696  0.0176242
      2 │ 0.842828   0.910385  0.216194   0.794668  0.664883
      3 │ 0.0350312  0.96542   0.837923   0.920311  0.748409
      4 │ 0.613249   0.731643  0.941826   0.688649  0.161736
   ⋮    │     ⋮         ⋮          ⋮         ⋮          ⋮
  99998 │ 0.767794   0.242687  0.965885   0.557483  0.723849
  99999 │ 0.743936   0.67815   0.529923   0.247698  0.861302
 100000 │ 0.628269   0.252583  0.985485   0.24541   0.942741
                                            99993 rows omitted

df is a dataframe with 100,000 rows and 5 columns.

julia> nrow(df)
100000

julia> ncol(df)
5

julia> size(df)
(100000, 5)

nrow() and ncol() return the number of rows and columns, respectively, and size() returns the size of the rows and columns as a tuple. By referencing them in the order of rows, columns, you can know the size of the rows and columns separately. At first glance, size() seems much more useful, but let’s compare their performances.

Optimization

julia> @time for i in 1:10^6
           nrow(df)
       end
  0.051730 seconds (1000.00 k allocations: 15.259 MiB)

julia> @time for i in 1:10^6
           size(df)[1]
       end
  0.536297 seconds (3.00 M allocations: 61.035 MiB, 5.44% gc time)

Above is a comparison of the speed of nrow() and size(). As expected, the single-function nrow() is faster. The test may seem forced by running it excessively, but in cases where the dataframe is significantly larger—handling big data or using size() thinking it won’t make much difference can result in a waste of unnecessary time.

Also, there is a big difference in terms of code readability. nrow() and ncol() are function names commonly used in other languages and are undoubtedly the number of rows and columns, but size() greatly reduces readability due to the suffixing index. If possible, it is advised to use nrow() and ncol().

Full Code

using DataFrames

df = DataFrame(rand(100000,5), :auto)

nrow(df)
ncol(df)
size(df)

@time for i in 1:10^6
    nrow(df)
end

@time for i in 1:10^6
    size(df)[1]
end

Environment

  • OS: Windows
  • julia: v1.6.3