Zipf's Law
Laws
Given a corpus, if the relative frequency of the th most common word is denoted as , then
Explanation
Here, is a normalization factor that makes possible. If shown as a histogram, the shape is roughly as described above, with the scale adjusted so the total area precisely equals .
The thick tail shape that appears on the right is called a long tail. Like Heaps’ law, it is an empirical law that not only fits well but also has the advantage of being succinct.
On the other hand, taking the logarithm of both sides seems to provide a linear relationship as shown by . However, be careful because this may not always hold true, especially for words that do not appear frequently, contrary to theoretical expectations.