logo

Shannon Information: Information Defined by Probability Theory 📂Probability Theory

Shannon Information: Information Defined by Probability Theory

Buildup

Card Matching Game

Imagine Alice and Bob are betting on which one of the 52 cards in a deck (excluding Jokers) they have drawn facedown.

Alice: The card I drew is not a Joker.

Bob frowns immediately upon hearing this. While it’s technically true, it’s such an obvious statement that it holds no meaningful insight for the bet. A discussion on the stakes seems necessary before proceeding with the bet. The two agree vaguely to take a portion $1 - r/52$ of the pot, where the count of cards has been narrowed down to $r$. The reason the denominator is $52$ is simply because there are 52 cards in total. Alice has just reduced the possible cards from 52 to $r=52$, so there’s no reward for winning this time. If a card is guessed correctly, almost all of the pot is won. We’ll consider what the optimal strategy for this game might be later, for now, let’s focus on the cards.

Bob: The suit is hearts.
Alice: The number is $7$.

Regardless of what card was actually drawn, if both individuals are telling the truth, who should receive the higher share? Bob has narrowed it down to one of the four suits, hence to $1/4$, whereas Alice has specified one of the 13 ranks, hence to $1/13$. Without any prior knowledge, Alice’s statement is less likely to be correct, making her guess more valuable information-wise when compared to Bob’s.

Let’s pay attention to the fact that Bob guessed a suit and Alice a rank. Since a deck of playing cards doesn’t add or omit any numbers across suits, the suit and rank of a drawn card don’t follow any patterns. Given no hints are provided by either side, a guess like ‘7 of Hearts’ should logically combine the value of both guesses without any loss.

From the above analogy, we’ve seen how the value of a statement increases when it’s not a mere guess. If we agree that the concept of ‘amount of information’ can be defined by the likelihood of the statement being true, then let’s transcribe these concepts into formulas. Here, an ’event’ corresponds to a ‘statement’.

Conditions That Information Must Satisfy

The information of an event $I$ must satisfy the following conditions for it to be a function:

  • (i): For all events $E$, $$ I(E) \ge 0 $$
  • (ii): If for two events $E_{1} , E_{2}$, it holds that $P \left( E_{1} \right) \le P \left( E_{2} \right)$ then, $$ I \left( E_{1} \right) \ge I \left( E_{2} \right) $$
  • (iii): If two events $E_{1} , E_{2}$ are independent, $$ I \left( E_{1} \cap E_{2} \right) = I \left( E_{1} \right) + I \left( E_{2} \right) $$

For example, for some constant $K, a$, $$ I(E) := -K \log_{a} \left( P (E) \right) $$ would satisfy all of the above clauses. Since the inside of the logarithmic function is a probability, it cannot exceed $1$, hence satisfying (i). And since the original function $\log$ is an increasing function, (ii) is easily satisfied. The notable condition is (iii), where, when two events are independent, the logarithmic function easily satisfies $$ \begin{align*} I \left( E_{1} + E_{2} \right) =& -K \log_{a} \left( P \left( E_{1} \cap E_{2} \right) \right) \\ =& -K \log_{a} \left( P \left( E_{1} \right) P \left( E_{2} \right) \right) \\ =& -K \log_{a} \left( P \left( E_{1} \right) \right) -K \log_{a} \left( P \left( E_{2} \right) \right) \\ =& I \left( E_{1} \right) + I \left( E_{2} \right) \end{align*} $$ but it’s hard to find other functions where the logical product comes out of the function as an addition like in the third line. In fact, the logarithm is the only one proven to work, which is why it is indeed defined as $K=1, a=2$.

Complex Definition

Given a probability space $\left( \Omega , \mathcal{F}, P \right)$, the following defined $I$ is known as Shannon information or Information Content.

Information Content of an Event 1

The information content of event $E \in \mathcal{F}$ is defined as follows.

$$ I(E) := - \log_{2} P(E) $$

Information Content of a Random Variable

The information content of a random variable $X$ defined in a given probability space is represented by another univariate random variable $I(X) : \mathcal{F} \to \mathbb{R}^{1}$ with the following probability distribution.

$$ I \left( X (E) \right) := - \log_{2} P(E) \qquad \text{ with probability } P(E) $$

Explanation

In information theory, it’s generally understood that unless specified otherwise, the base of the logarithm is not $e$ but $2$, and the unit used is Bit.

If you’ve understood the Buildup, grasping why probabilities are placed inside a logarithm for the definition of information content shouldn’t be difficult. The question is why extend the concept to random variables, which can be understood by remembering that a random variable essentially maps real-world events into the mathematical realm, specifically aiming for the distribution to reflect information content.

It’s Okay if It’s Challenging

Don’t worry too much if you’re not quite understanding what a probability space is from the definition. Although the information content of a random variable is not defined this complexly in references, considering the discussion leading to Shannon entropy, this is an oversimplified version.

Even if it were a ‘simple’ definition, that doesn’t mean you can forego a solid foundation in measure theory/probability theory. Unless you’re diving deep into mathematics, statistics, or machine learning, there’s no need to fully grasp the rigorous math, just the concepts would suffice.

See Also


  1. Applebaum. (2008). Probability and Information(2nd Edition): p107. ↩︎