grouping and Calculating DataFrames in Julia


Using groupby() to divide by group and combine() for calculation is the way to go1.

groupby(df, :colname)

  • Returns a groupedDataFrame based on :colname.

combine(gdf, :colname => fun)

  • gdf is a groupedDataFrame divided by groups.
  • :colname => fun represents a pair of the symbol :colname, which is the name of the column containing the values to be calculated, and the calculation function fun.


using DataFrames
using StatsBase

WJSN = DataFrame(
    member = ["다영","다원","루다","소정","수빈","연정","주연","지연","진숙","현정"],
    birth = [99,97,97,95,96,99,98,95,99,94],
    height = [161,167,157,166,159,165,172,163,162,165],
    unit = ["쪼꼬미","메보즈","쪼꼬미","더블랙","쪼꼬미","메보즈","더블랙","더블랙","쪼꼬미","더블랙"]
sort!(WJSN, :birth)

unique(WJSN, :unit)

units = groupby(WJSN, :unit)

combine(units, :height => mean)

Let’s run the example code above and check the result.

julia> WJSN
10×4 DataFrame
 Row │ member  birth  height  unit   
     │ String  Int64  Int64   String 
   1 │ 현정       94     165  더블랙
   2 │ 소정       95     166  더블랙
   3 │ 지연       95     163  더블랙
   4 │ 수빈       96     159  쪼꼬미
   5 │ 다원       97     167  메보즈
   6 │ 루다       97     157  쪼꼬미
   7 │ 주연       98     172  더블랙
   8 │ 다영       99     161  쪼꼬미
   9 │ 연정       99     165  메보즈
  10 │ 진숙       99     162  쪼꼬미

The WJSN dataframe is as shown above.

Dividing by group groupby()

julia> units = groupby(WJSN, :unit)
groupedDataFrame with 3 groups based on key: unit
First group (4 rows): unit = "더블랙"
 Row │ member  birth  height  unit   
     │ String  Int64  Int64   String 
   1 │ 현정       94     165  더블랙
   2 │ 소정       95     166  더블랙
   3 │ 지연       95     163  더블랙
   4 │ 주연       98     172  더블랙
Last group (2 rows): unit = "메보즈"
 Row │ member  birth  height  unit   
     │ String  Int64  Int64   String 
   1 │ 다원       97     167  메보즈
   2 │ 연정       99     165  메보즈

The dataframe was divided into three groups based on the :unit column.

julia> units[1]
4×4 SubDataFrame
 Row │ member  birth  height  unit   
     │ String  Int64  Int64   String 
   1 │ 현정       94     165  더블랙
   2 │ 소정       95     166  더블랙
   3 │ 지연       95     163  더블랙
   4 │ 주연       98     172  더블랙

julia> units[2]
4×4 SubDataFrame
 Row │ member  birth  height  unit   
     │ String  Int64  Int64   String 
   1 │ 수빈       96     159  쪼꼬미
   2 │ 루다       97     157  쪼꼬미
   3 │ 다영       99     161  쪼꼬미
   4 │ 진숙       99     162  쪼꼬미

julia> units[3]
2×4 SubDataFrame
 Row │ member  birth  height  unit   
     │ String  Int64  Int64   String 
   1 │ 다원       97     167  메보즈
   2 │ 연정       99     165  메보즈

By indexing into the groupedDataFrame like above, we can access the divided dataframes.

Calculating by group combine()

julia> combine(units, :height => mean)
3×2 DataFrame
 Row │ unit    height_mean 
     │ String  Float64     
   1 │ 더블랙       166.5
   2 │ 쪼꼬미       159.75
   3 │ 메보즈       166.0

The code above calculates the average mean of :height in the WJSN dataframe, which is grouped by :unit into the dataframe units. As mentioned in the overview, this StatBase.mean() is the function for calculating the average. Changing this to sum() calculates the sum, and to min() calculates the minimum value for each group. In this example, the average of :height by :unit was calculated, and the 쪼꼬미 group was found to have the lowest average at 159.75.