logo

Heaps' Law 📂Algorithm

Heaps' Law

Law

Given the number of unique words as MM, and the number of tokens as TT in a corpus, M=kTb M = kT^{b}

Explanation

When the corpus is in English, the constant k,bk,b is typically 10k10010 \le k \le 100, and about b=0.5b = 0.5. Heaps’ law is not derived from a mathematical foundation but empirically obtained.

The formula may seem quite complex at first glance, but if both sides are logarithmically transformed, it becomes logM=logk+blogT\log M = \log k + b \log T, and it can be easily seen that there is a linear relationship. kk represents the intercept, and bb represents the slope.

20180509\_132932.png