logo

Heaps' Law 📂Algorithm

Heaps' Law

Law

Given the number of unique words as $M$, and the number of tokens as $T$ in a corpus, $$ M = kT^{b} $$

Explanation

When the corpus is in English, the constant $k,b$ is typically $10 \le k \le 100$, and about $b = 0.5$. Heaps’ law is not derived from a mathematical foundation but empirically obtained.

The formula may seem quite complex at first glance, but if both sides are logarithmically transformed, it becomes $\log M = \log k + b \log T$, and it can be easily seen that there is a linear relationship. $k$ represents the intercept, and $b$ represents the slope.

20180509\_132932.png