grouping and Calculating DataFrames in Julia
Overview
Using groupby() to divide by group and combine() for calculation is the way to go1.
groupby(df, :colname)
- Returns a
groupedDataFramebased on:colname.
combine(gdf, :colname => fun)
gdfis agroupedDataFramedivided by groups.:colname => funrepresents a pair of the symbol:colname, which is the name of the column containing the values to be calculated, and the calculation functionfun.
Code
using DataFrames
using StatsBase
WJSN = DataFrame(
member = ["다영","다원","루다","소정","수빈","연정","주연","지연","진숙","현정"],
birth = [99,97,97,95,96,99,98,95,99,94],
height = [161,167,157,166,159,165,172,163,162,165],
unit = ["쪼꼬미","메보즈","쪼꼬미","더블랙","쪼꼬미","메보즈","더블랙","더블랙","쪼꼬미","더블랙"]
)
sort!(WJSN, :birth)
unique(WJSN, :unit)
units = groupby(WJSN, :unit)
units[1]
units[2]
units[3]
combine(units, :height => mean)
Let’s run the example code above and check the result.
julia> WJSN
10×4 DataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 현정 94 165 더블랙
2 │ 소정 95 166 더블랙
3 │ 지연 95 163 더블랙
4 │ 수빈 96 159 쪼꼬미
5 │ 다원 97 167 메보즈
6 │ 루다 97 157 쪼꼬미
7 │ 주연 98 172 더블랙
8 │ 다영 99 161 쪼꼬미
9 │ 연정 99 165 메보즈
10 │ 진숙 99 162 쪼꼬미
The WJSN dataframe is as shown above.
Dividing by group groupby()
julia> units = groupby(WJSN, :unit)
groupedDataFrame with 3 groups based on key: unit
First group (4 rows): unit = "더블랙"
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 현정 94 165 더블랙
2 │ 소정 95 166 더블랙
3 │ 지연 95 163 더블랙
4 │ 주연 98 172 더블랙
⋮
Last group (2 rows): unit = "메보즈"
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 다원 97 167 메보즈
2 │ 연정 99 165 메보즈
The dataframe was divided into three groups based on the :unit column.
julia> units[1]
4×4 SubDataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 현정 94 165 더블랙
2 │ 소정 95 166 더블랙
3 │ 지연 95 163 더블랙
4 │ 주연 98 172 더블랙
julia> units[2]
4×4 SubDataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 수빈 96 159 쪼꼬미
2 │ 루다 97 157 쪼꼬미
3 │ 다영 99 161 쪼꼬미
4 │ 진숙 99 162 쪼꼬미
julia> units[3]
2×4 SubDataFrame
Row │ member birth height unit
│ String Int64 Int64 String
─────┼───────────────────────────────
1 │ 다원 97 167 메보즈
2 │ 연정 99 165 메보즈
By indexing into the groupedDataFrame like above, we can access the divided dataframes.
Calculating by group combine()
julia> combine(units, :height => mean)
3×2 DataFrame
Row │ unit height_mean
│ String Float64
─────┼─────────────────────
1 │ 더블랙 166.5
2 │ 쪼꼬미 159.75
3 │ 메보즈 166.0
The code above calculates the average mean of :height in the WJSN dataframe, which is grouped by :unit into the dataframe units. As mentioned in the overview, this StatBase.mean() is the function for calculating the average. Changing this to sum() calculates the sum, and to min() calculates the minimum value for each group. In this example, the average of :height by :unit was calculated, and the 쪼꼬미 group was found to have the lowest average at 159.75.
