Shannon Entropy: Entropy Defined by Random Variables
Overview
Shannon Entropy or Information Entropy is a measure of disorder defined by a probability variable, and can be viewed as a quantification of how uncertain it is in a probability distribution.
Easy and Complex Definitions
Discrete Entropy 1
When the probability mass function of a discrete random variable $X$ is $p(x)$, the entropy of $X$ is represented as follows. $$ H(X) := - \sum p(x) \log_{2} p(x) $$
Continuous Entropy 2
When the value of a continuous random variable $X$ is a probability density function $f(x)$, the entropy of $X$ is represented as follows. $$ H(X) := - \int_{-\infty}^{\infty} f(x) \log_{2} f(x) dx $$
Hard yet Simple Definitions
The expected value $H(X)$ of the Shannon information $I(X)$ for the random variable $X$ is called entropy. $$ H(X) := E(I(X)) $$
Explanation
When the probability mass function of a random variable $X, Y$ is $p, q$, entropy is also expressed as follows. $$ H(X) = H(p) \\ H(Y) = H(q) $$
Entropy is a concept widely used across various fields of science, and no matter how it is defined, its abstract meaning is generally ’the degree of disorder’. While it might seem unrelated to entropy in thermodynamics at first glance, according to Gibbs’ entropy formulation, $$ S = - k_{B} \sum_{i} P_{i} \ln P_{i} $$ its form is strikingly similar, and historically, there is a deep connection. An anecdote says that when Claude Shannon first discovered $H(X)$ and its importance, he consulted John Von Neumann on what to name it, to which Von Neumann replied:
For two reasons, $H$ should be called entropy. First, that function has already been called entropy in thermodynamics. Second, because most people don’t really know what entropy is, so if you use the word ’entropy’ in any argument, you will win.
Disorder
Let’s see how the expected value of information, entropy, naturally represents disorder.
Consider a Bernoulli distribution with probability $p$. For example, imagine tossing a coin rigged with a probability of $p \in (0,1)$ for heads. The entropy of the random variable $X$ representing the coin’s heads or tails would be exactly calculated as follows. $$ H(X) = - p \log_{2} p - (1-p) \log_{2} (1-p) $$ The closer $p$ is to $0$ or $1$, the more uncertainty decreases and disorder increases. If there was a game of guessing heads or tails with a coin with a probability of heads being $90\% $, one might as well choose heads slightly more favorably without needing to pick tails. Let’s actually calculate and see if this matches the intuition. If $p = 1/4$, then $$ \begin{align*} H(X) =& - {{ 1 } \over { 4 }} \log_{2} {{ 1 } \over { 4 }} - {{ 3 } \over { 4 }} \log_{2} {{ 3 } \over { 4 }} \\ =& {{ 1 } \over { 4 }} \log_{2} 4 - {{ 3 } \over { 4 }} \left( \log_{2} 3 - \log_{2} 4 \right) \\ =& {{ 1 } \over { 2 }} - {{ 3 } \over { 4 }} \log_{2} 3 + {{ 3 } \over { 2 }} \\ =& 2 - {{ 3 } \over { 4 }} \log_{2} 3 \end{align*} $$ Calculating this value to a real number would be about $0.81$. Now calculating when $p = 1/2$ reveals $$ \begin{align*} H(X) =& - {{ 1 } \over { 2 }} \log_{2} {{ 1 } \over { 2 }} - {{ 1 } \over { 2 }} \log_{2} {{ 1 } \over { 2 }} \\ =& {{ 1 } \over { 2 }} + {{ 1 } \over { 2 }} \\ =& 1 \end{align*} $$ that the entropy is higher than when $p=1/4$. Indeed, this represents the most chaotic and disordered state where it’s completely unknowable whether heads or tails will come up.
As another example, consider a random variable $X$ following a uniform distribution $\text{Uni}(a,b)$, its entropy can be simply calculated as $$ \begin{align*} H(X) =& - \int_{a}^{b} {{ 1 } \over { b-a }} \log_{2} {{ 1 } \over { b-a }} dx \\ =& \log_{2} \left( b-a \right) \end{align*} $$ It has been said that entropy is a measure of disorder, and widening the interval between $b$ and $a$ means it becomes increasingly difficult to closely guess what value $X$ will take, thereby increasing $\log_{2} (b-a)$ as well. This further examines how naturally entropy serves as a measure of disorderliness.
Limits of the Easy Definition
Anyone reasonably educated should see no difference between the easy and hard definitions. The latter is just more general and covers all that the former states. The reference cited defines discrete entropy for cases where events are finite and does a good job defining continuous entropy but points out a problem when approaching with the concept of limits.
Abstractly speaking, since Shannon information is a random variable that aligns the original probability distribution per event with its amount of information, it doesn’t necessarily have to be defined as discrete or continuous, without the need to consider finite, infinite, countable, uncountable, integration ranges, etc. The definition of entropy can be simply arrived at as ’the expected value of information’ unless there is a problem with the definition of information itself.