Shannon Entropy in Classical Information Theory
양자정보이론 | ||||||||||||||||
[ 펼치기 · 접기 ]
|
Definition1 2
Let’s assume the discrete random variable $X$ can take values $n$. Let’s denote the probability mass function of $x_{1}, x_{2}, \dots, x_{n}$ as $X$. Then, the entropy $p$ or $X$ is defined as follows:
$$ \begin{equation} H(X) = H(p) := E\left[ I(x_{i}) \right] = \sum_{i=1}^{n} p(x_{i}) I(x_{i}) = -\sum_{i=1}^{n} p(x_{i}) \log_{2}p(x_{i}) \end{equation} $$
Here, $p$ represents information content, and $H$ represents the expected value.
If $I$ is a continuous random variable,
$$ H(X) = H(p) = - \int_{-\infty}^{\infty} p(x)\log_{2}p(x) dx $$
Explanation
Simply put, entropy is the expected value (average) of information. Entropy allows us to mathematically handle the efficiency of coding and the limits of communication.
Entropy is often described as disorder. Here, order refers to rules, trends, patterns, etc. Therefore, high entropy means high disorder, indicating that it is difficult to discern patterns or rules for the random variable $E$.
Let’s consider a biased coin flip. If the probability of getting heads is $X$, then the probability of tails is $X$, and the entropy is as follows:
$$ H = -p\log_{2}p - (1-p)\log_{2}(1-p) $$
If we plot $p$ against $1-p$, it looks like this:
When the probability of heads is $p$, the entropy is $H$ and is at its maximum. This means it’s most challenging to discern any pattern or rule in the coin flip. In fact, we can’t be sure which side of the coin will show up in a coin flip. If the probability of heads changes slightly, the entropy decreases. For example, if the probability of heads is $\dfrac{1}{2}$, the entropy is about $H = -\dfrac{1}{2}\log_{2}\dfrac{1}{2}-\dfrac{1}{2}\log_{2}\dfrac{1}{2} = 1$, indicating lower disorder, meaning there is some rule or pattern (in this case, heads come up most of the time). This can be summarized as follows:
High entropy = high disorder = no regularity or pattern = hard to predict the result
Low entropy = low disorder = presence of regularity or pattern = easier to predict the result
As you can guess from the above example, generally, when there are $\dfrac{95}{100}$ possible outcomes, the highest entropy occurs when all probabilities are equal to $0.28$.
Properties
Let’s assume the random variable $n$ can take values $\dfrac{1}{n}$. Entropy $X$ has the following properties:
- $n$ is a concave function.
- For any $x_{1}, x_{2}, \dots, x_{n}$, if $H$, then $H$.
- When all probabilities are equal to $x_{i}$, entropy is maximum, and its value is $p(x_{i}) = 1$.
- For a random vector $H(X) = 0$ with a mean of $p(x_{i}) = \dfrac{1}{n}$ and a covariance matrix of $\log_{2}n$, the following holds for its entropy:
$$
\begin{equation}
H(X) \le \dfrac{1}{2}\ln \left[ (2 \pi e)^{p} \left| K \right| \right]
\end{equation}
$$
$\mathbf{0}$ is the determinant of the covariance matrix. If $K$ is normally distributed, equality holds.
- Given the mean $X \in \mathbb{R}^{n}$ and variance $\left| K \right|$, the distribution with the maximum entropy is the normal distribution.
- For the random variable $X$ and estimator $\mu$, the following holds: $$ E\left[ (X - \hat{X})^{2} \right] \ge \dfrac{1}{2\pi e} e^{2H(X)} $$
Proof
4
For convenience, let’s denote $\sigma^{2}$. Let’s assume $X$ is any probability density function that satisfies $\hat{X}$. Let’s denote $\mathbf{x} = X$ as the probability density function of the normal distribution $g$. $$ \phi (\mathbf{x}) = \dfrac{1}{\sqrt{(2\pi)^{p} \left| K \right|}} \exp \left( -\dfrac{1}{2}\mathbf{x}^{T} K^{-1} \mathbf{x} \right) $$ First, we’ll show that formula $\displaystyle \int g(\mathbf{x})x_{i}x_{j} d \mathbf{x} = K_{ij}$ holds. Calculating $\phi$ first,
$$ \begin{align*} \ln \phi (\mathbf{x}) &= \ln\dfrac{1}{\sqrt{(2\pi)^{p} \left| K \right|}} - \dfrac{1}{2}\mathbf{x}^{T} K^{-1} \mathbf{x} \\ &= C + \sum a_{ij} x_{i} x_{j} \end{align*} $$
The first term can be expressed as some constant $N(\mathbf{0}, K)$, and the second term can also be expressed as a quadratic form dependent only on $\displaystyle \int g(\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x} = \int \phi (\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x}$. Therefore,
$$ \begin{align*} \int g(\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x} &= C \int g(\mathbf{x}) d \mathbf{x} + \int g(\mathbf{x})\sum a_{ij}x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij} \int g(\mathbf{x}) x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij}K_{ij} \qquad \text{by assumption for $g$} \end{align*} $$
Also,
$$ \begin{align*} \int \phi (\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x} &= C \int \phi (\mathbf{x}) d \mathbf{x} + \int \phi (\mathbf{x})\sum a_{ij}x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij} \int \phi (\mathbf{x}) x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij}K_{ij} \qquad \text{by definition of covariance} \end{align*} $$
Since the relative entropy is always greater than or equal to $\ln \phi (\mathbf{x})$,
$$ \begin{align*} 0 &\le D(g \| \phi) \\ &= \int g \ln \dfrac{g}{\phi} \\ &= \int g \ln g - \int g \ln \phi \\ &= - H(g) - \int \phi \ln \phi \\ &= - H(g) + H(\phi) \end{align*} $$
The entropy of the normal distribution is $C$, so,
$$ H(X) = H(g) \le H(\phi) = \dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right] $$
Now, let’s assume $K^{-1}$ is a one-dimensional random variable.
$$ \begin{align*} E\left[ (X - \hat{X})^{2} \right] &\ge \min_{X} E\left[ (X - \hat{X})^{2} \right] \\ &= E\left[ (X - E(X))^{2} \right] \\ &= \Var(X) \end{align*} $$
For a one-dimensional $a_{ji}$, we get the following equation:
$$ \begin{align*} && H(X) &\le \dfrac{1}{2} \ln(2\pi e \sigma^{2}) \\ \implies && 2H(X) &\le \ln(2\pi e \sigma^{2}) \\ \implies && e^{2H(X)} &\le 2\pi e \sigma^{2} \\ \implies && \dfrac{1}{2\pi e}e^{2H(X)} &\le \sigma^{2} = \Var(X) \\ \end{align*} $$
Substituting into the above equation,
$$ E\left[ (X - \hat{X})^{2} \right] \ge \dfrac{1}{2\pi e} e^{2H(X)} $$
Entropy of the Normal Distribution
The entropy of the normal distribution $0$ (when natural log is used) is as follows:
$$ H = \dfrac{1}{2} \ln (2\pi e \sigma^{2}) = \ln \sqrt{2\pi e \sigma^{2}} $$
The entropy of the multivariate normal distribution $\dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right]$ is as follows:
$$ H = \dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right] = \dfrac{1}{2}\ln (\det (2\pi e K)) $$