grouping and Calculating DataFrames in Julia
Overview
Using groupby()
to divide by group and combine()
for calculation is the way to go1.
groupby(df, :colname)
- Returns a
groupedDataFrame
based on:colname
.
combine(gdf, :colname => fun)
gdf
is agroupedDataFrame
divided by groups.:colname => fun
represents a pair of the symbol:colname
, which is the name of the column containing the values to be calculated, and the calculation functionfun
.
Code
using DataFrames
using StatsBase
WJSN = DataFrame(
member = ["다영","다원","루다","소정","수빈","연정","주연","지연","진숙","현정"],
birth = [99,97,97,95,96,99,98,95,99,94],
height = [161,167,157,166,159,165,172,163,162,165],
unit = ["쪼꼬미","메보즈","쪼꼬미","더블랙","쪼꼬미","메보즈","더블랙","더블랙","쪼꼬미","더블랙"]
)
sort!(WJSN, :birth)
unique(WJSN, :unit)
units = groupby(WJSN, :unit)
units[1]
units[2]
units[3]
combine(units, :height => mean)
Let’s run the example code above and check the result.
julia> WJSN
10×4 DataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 현정 94 165 더블랙
2 │ 소정 95 166 더블랙
3 │ 지연 95 163 더블랙
4 │ 수빈 96 159 쪼꼬미
5 │ 다원 97 167 메보즈
6 │ 루다 97 157 쪼꼬미
7 │ 주연 98 172 더블랙
8 │ 다영 99 161 쪼꼬미
9 │ 연정 99 165 메보즈
10 │ 진숙 99 162 쪼꼬미
The WJSN
dataframe is as shown above.
Dividing by group groupby()
julia> units = groupby(WJSN, :unit)
groupedDataFrame with 3 groups based on key: unit
First group (4 rows): unit = "더블랙"
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 현정 94 165 더블랙
2 │ 소정 95 166 더블랙
3 │ 지연 95 163 더블랙
4 │ 주연 98 172 더블랙
⋮
Last group (2 rows): unit = "메보즈"
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 다원 97 167 메보즈
2 │ 연정 99 165 메보즈
The dataframe was divided into three groups based on the :unit
column.
julia> units[1]
4×4 SubDataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 현정 94 165 더블랙
2 │ 소정 95 166 더블랙
3 │ 지연 95 163 더블랙
4 │ 주연 98 172 더블랙
julia> units[2]
4×4 SubDataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 수빈 96 159 쪼꼬미
2 │ 루다 97 157 쪼꼬미
3 │ 다영 99 161 쪼꼬미
4 │ 진숙 99 162 쪼꼬미
julia> units[3]
2×4 SubDataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 다원 97 167 메보즈
2 │ 연정 99 165 메보즈
By indexing into the groupedDataFrame
like above, we can access the divided dataframes.
Calculating by group combine()
julia> combine(units, :height => mean)
3×2 DataFrame
Row │ unit height_mean
│ String Float64
─────┼─────────────────────
1 │ 더블랙 166.5
2 │ 쪼꼬미 159.75
3 │ 메보즈 166.0
The code above calculates the average mean
of :height
in the WJSN
dataframe, which is grouped by :unit
into the dataframe units
. As mentioned in the overview, this StatBase.mean()
is the function for calculating the average. Changing this to sum()
calculates the sum, and to min()
calculates the minimum value for each group. In this example, the average of :height
by :unit
was calculated, and the 쪼꼬미
group was found to have the lowest average at 159.75
.