logo

Zipf's Law 📂Algorithm

Zipf's Law

Laws

Given a corpus, if the relative frequency of the $k$th most common word is denoted as $f_{k}$, then $$ f_{k} = {{C} \over {k}} $$

Explanation

Here, $C$ is a normalization factor that makes $\displaystyle \sum_{k} f_{k} = 1$ possible. If shown as a histogram, the shape is roughly as described above, with the scale adjusted so the total area precisely equals $1$.

LKfZK.png

The thick tail shape that appears on the right is called a long tail. Like Heaps’ law, it is an empirical law that not only fits well but also has the advantage of being succinct.

20180509_152201.png

On the other hand, taking the logarithm of both sides seems to provide a linear relationship as shown by $\log f_{k} = \log C - \log k$. However, be careful because this may not always hold true, especially for words that do not appear frequently, contrary to theoretical expectations.