Pythagorean Winning Percentage 📂Sabermetrics

Pythagorean Winning Percentage

Formula ¹

Given a team from a particular sports league, let’s discuss Team Scores $S$ and Team Allows $A$ . The expected winning percentage of the season for this team $p$ is as follows. $p = {{ S^{2} } \over { S^{2} + A^{2} }} = {{ 1 } \over { 1 + (A/S)^{2} }}$

Explanation

The Pythagorean Expectation, proposed by Bill James, is a nonlinear model that uses team scores and team allows as independent variables to explain the season’s winning percentage. It’s self-evident that scoring more often results in more wins, and allowing more points results in more losses; however, quantitatively analyzing this is an entirely different matter.

The above diagram shows a scatter plot comparing each team’s scoring ratio and winning rate from ‘82 to ‘20 in a certain baseball league, indicating a curve slightly deviated from an ideal linear relationship. Bill James discovered an intuitive formula to explain this phenomenon, which exceedingly well described the actual data. Later on, Steven Miller justified the mathematical derivation process statistically.

Statistical Derivation

The name Pythagorean winning rate itself suggests the denominator $S^{2} + A^{2}$ , reminiscent of the Pythagorean theorem. However, it can actually be generalized for a positive value $\gamma \ne 2$ , and indeed, since 1954, Major League Baseball found $\gamma \approx 1.85$ to be the most appropriate. $p_{\gamma} = {{ S^{\gamma} } \over { S^{\gamma} + A^{\gamma} }}$ Besides mathematical generalization, with the right assumptions, it can be adapted to other sports. In the NBA (Basketball), a very large value of $14 < \gamma < 17$ is suggested, and in the NFL (American football), it’s about $\gamma \approx 2.4$ .

For a specific derivation, refer to the post summarizing Steven Miller’s paper².

Code

The following is R code capable of reproducing the diagram in the explanation.

library(ggplot2)

post_url = "https://freshrimpsushi.github.io/posts/2217/"

team_pitch = read.csv(paste0(post_url, "팀투구82_20.csv"), header = TRUE, encoding = 'UTF-8')[,-1]
team_hit = read.csv(paste0(post_url, "팀타격82_20.csv"), header = TRUE, encoding = 'UTF-8')[,-1]

data = data.frame(
    팀승률 = team_pitch$승 / team_pitch$선발,
    득실점비 = team_hit$득점 / team_pitch$실점
)

ggplot(data, aes(x = 득실점비, y = 팀승률)) +
 geom_point(alpha = 0.5, shape = 16) +
 theme_bw() + coord_fixed(ratio = 2)
ggsave("득점비율vs팀승률.png", width = 480, height = 480, units = "px", dpi = 120)

송민구 역, Baumer. (2015). 세이버메트릭스 레볼루션(THE SABERMETRIC REVOLUTION): p 100 ↩︎
참고 문헌에서는 “팀의 득점과 실점은 독립적이며, 둘 모두 정규분포와 비슷한 형태를 가지고 있기 때문에 이런 결과가 나올 수 있음을 밝혀냈다.“라 설명하고 있으나, 실제로는 베이불 분포를 가정하며 베이불 분포는 딱히 정규분포와 닮지 않았다. ↩︎