logo

Zipf's Law 📂Algorithm

Zipf's Law

Laws

Given a corpus, if the relative frequency of the kkth most common word is denoted as fkf_{k}, then fk=Ck f_{k} = {{C} \over {k}}

Explanation

Here, CC is a normalization factor that makes kfk=1\displaystyle \sum_{k} f_{k} = 1 possible. If shown as a histogram, the shape is roughly as described above, with the scale adjusted so the total area precisely equals 11.

LKfZK.png

The thick tail shape that appears on the right is called a long tail. Like Heaps’ law, it is an empirical law that not only fits well but also has the advantage of being succinct.

20180509_152201.png

On the other hand, taking the logarithm of both sides seems to provide a linear relationship as shown by logfk=logClogk\log f_{k} = \log C - \log k. However, be careful because this may not always hold true, especially for words that do not appear frequently, contrary to theoretical expectations.