logo

Shannon Entropy in Classical Information Theory

Shannon Entropy in Classical Information Theory

양자정보이론
[ 펼치기 · 접기 ]

Definition1 2

Let’s assume the discrete random variable XX can take values nn. Let’s denote the probability mass function of x1,x2,,xnx_{1}, x_{2}, \dots, x_{n} as XX. Then, the entropy pp or XX is defined as follows:

H(X)=H(p):=E[I(xi)]=i=1np(xi)I(xi)=i=1np(xi)log2p(xi) \begin{equation} H(X) = H(p) := E\left[ I(x_{i}) \right] = \sum_{i=1}^{n} p(x_{i}) I(x_{i}) = -\sum_{i=1}^{n} p(x_{i}) \log_{2}p(x_{i}) \end{equation}

Here, pp represents information content, and HH represents the expected value.

If II is a continuous random variable,

H(X)=H(p)=p(x)log2p(x)dx H(X) = H(p) = - \int_{-\infty}^{\infty} p(x)\log_{2}p(x) dx

Explanation

Simply put, entropy is the expected value (average) of information. Entropy allows us to mathematically handle the efficiency of coding and the limits of communication.

Entropy is often described as disorder. Here, order refers to rules, trends, patterns, etc. Therefore, high entropy means high disorder, indicating that it is difficult to discern patterns or rules for the random variable EE.

Let’s consider a biased coin flip. If the probability of getting heads is XX, then the probability of tails is XX, and the entropy is as follows:

H=plog2p(1p)log2(1p) H = -p\log_{2}p - (1-p)\log_{2}(1-p)

If we plot pp against 1p1-p, it looks like this:

entropy.png

When the probability of heads is pp, the entropy is HH and is at its maximum. This means it’s most challenging to discern any pattern or rule in the coin flip. In fact, we can’t be sure which side of the coin will show up in a coin flip. If the probability of heads changes slightly, the entropy decreases. For example, if the probability of heads is 12\dfrac{1}{2}, the entropy is about H=12log21212log212=1H = -\dfrac{1}{2}\log_{2}\dfrac{1}{2}-\dfrac{1}{2}\log_{2}\dfrac{1}{2} = 1, indicating lower disorder, meaning there is some rule or pattern (in this case, heads come up most of the time). This can be summarized as follows:

High entropy = high disorder = no regularity or pattern = hard to predict the result
Low entropy = low disorder = presence of regularity or pattern = easier to predict the result

As you can guess from the above example, generally, when there are 95100\dfrac{95}{100} possible outcomes, the highest entropy occurs when all probabilities are equal to 0.280.28.

Properties

Let’s assume the random variable nn can take values 1n\dfrac{1}{n}. Entropy XX has the following properties:

  1. nn is a concave function.
  2. For any x1,x2,,xnx_{1}, x_{2}, \dots, x_{n}, if HH, then HH.
  3. When all probabilities are equal to xix_{i}, entropy is maximum, and its value is p(xi)=1p(x_{i}) = 1.
  4. For a random vector H(X)=0H(X) = 0 with a mean of p(xi)=1np(x_{i}) = \dfrac{1}{n} and a covariance matrix of log2n\log_{2}n, the following holds for its entropy: H(X)12ln[(2πe)pK] \begin{equation} H(X) \le \dfrac{1}{2}\ln \left[ (2 \pi e)^{p} \left| K \right| \right] \end{equation} 0\mathbf{0} is the determinant of the covariance matrix. If KK is normally distributed, equality holds.
    • Given the mean XRnX \in \mathbb{R}^{n} and variance K\left| K \right|, the distribution with the maximum entropy is the normal distribution.
    • For the random variable XX and estimator μ\mu, the following holds: E[(XX^)2]12πee2H(X) E\left[ (X - \hat{X})^{2} \right] \ge \dfrac{1}{2\pi e} e^{2H(X)}

Proof

4

For convenience, let’s denote σ2\sigma^{2}. Let’s assume XX is any probability density function that satisfies X^\hat{X}. Let’s denote x=X\mathbf{x} = X as the probability density function of the normal distribution gg. ϕ(x)=1(2π)pKexp(12xTK1x) \phi (\mathbf{x}) = \dfrac{1}{\sqrt{(2\pi)^{p} \left| K \right|}} \exp \left( -\dfrac{1}{2}\mathbf{x}^{T} K^{-1} \mathbf{x} \right) First, we’ll show that formula g(x)xixjdx=Kij\displaystyle \int g(\mathbf{x})x_{i}x_{j} d \mathbf{x} = K_{ij} holds. Calculating ϕ\phi first,

lnϕ(x)=ln1(2π)pK12xTK1x=C+aijxixj \begin{align*} \ln \phi (\mathbf{x}) &= \ln\dfrac{1}{\sqrt{(2\pi)^{p} \left| K \right|}} - \dfrac{1}{2}\mathbf{x}^{T} K^{-1} \mathbf{x} \\ &= C + \sum a_{ij} x_{i} x_{j} \end{align*}

The first term can be expressed as some constant N(0,K)N(\mathbf{0}, K), and the second term can also be expressed as a quadratic form dependent only on g(x)lnϕ(x)dx=ϕ(x)lnϕ(x)dx\displaystyle \int g(\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x} = \int \phi (\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x}. Therefore,

g(x)lnϕ(x)dx=Cg(x)dx+g(x)aijxixjdx=C+aijg(x)xixjdx=C+aijKijby assumption for g \begin{align*} \int g(\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x} &= C \int g(\mathbf{x}) d \mathbf{x} + \int g(\mathbf{x})\sum a_{ij}x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij} \int g(\mathbf{x}) x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij}K_{ij} \qquad \text{by assumption for gg} \end{align*}

Also,

ϕ(x)lnϕ(x)dx=Cϕ(x)dx+ϕ(x)aijxixjdx=C+aijϕ(x)xixjdx=C+aijKijby definition of covariance \begin{align*} \int \phi (\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x} &= C \int \phi (\mathbf{x}) d \mathbf{x} + \int \phi (\mathbf{x})\sum a_{ij}x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij} \int \phi (\mathbf{x}) x_{i}x_{j} d \mathbf{x} \\ &= C + \sum a_{ij}K_{ij} \qquad \text{by definition of covariance} \end{align*}

Since the relative entropy is always greater than or equal to lnϕ(x)\ln \phi (\mathbf{x}),

0D(gϕ)=glngϕ=glngglnϕ=H(g)ϕlnϕ=H(g)+H(ϕ) \begin{align*} 0 &\le D(g \| \phi) \\ &= \int g \ln \dfrac{g}{\phi} \\ &= \int g \ln g - \int g \ln \phi \\ &= - H(g) - \int \phi \ln \phi \\ &= - H(g) + H(\phi) \end{align*}

The entropy of the normal distribution is CC, so,

H(X)=H(g)H(ϕ)=12ln[(2πe)nK] H(X) = H(g) \le H(\phi) = \dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right]

Now, let’s assume K1K^{-1} is a one-dimensional random variable.

E[(XX^)2]minXE[(XX^)2]=E[(XE(X))2]=Var(X) \begin{align*} E\left[ (X - \hat{X})^{2} \right] &\ge \min_{X} E\left[ (X - \hat{X})^{2} \right] \\ &= E\left[ (X - E(X))^{2} \right] \\ &= \Var(X) \end{align*}

For a one-dimensional ajia_{ji}, we get the following equation:

H(X)12ln(2πeσ2)    2H(X)ln(2πeσ2)    e2H(X)2πeσ2    12πee2H(X)σ2=Var(X) \begin{align*} && H(X) &\le \dfrac{1}{2} \ln(2\pi e \sigma^{2}) \\ \implies && 2H(X) &\le \ln(2\pi e \sigma^{2}) \\ \implies && e^{2H(X)} &\le 2\pi e \sigma^{2} \\ \implies && \dfrac{1}{2\pi e}e^{2H(X)} &\le \sigma^{2} = \Var(X) \\ \end{align*}

Substituting into the above equation,

E[(XX^)2]12πee2H(X) E\left[ (X - \hat{X})^{2} \right] \ge \dfrac{1}{2\pi e} e^{2H(X)}

Entropy of the Normal Distribution

The entropy of the normal distribution 00 (when natural log is used) is as follows:

H=12ln(2πeσ2)=ln2πeσ2 H = \dfrac{1}{2} \ln (2\pi e \sigma^{2}) = \ln \sqrt{2\pi e \sigma^{2}}

The entropy of the multivariate normal distribution 12ln[(2πe)nK]\dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right] is as follows:

H=12ln[(2πe)nK]=12ln(det(2πeK)) H = \dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right] = \dfrac{1}{2}\ln (\det (2\pi e K))

See Also


  1. Kim Young-hoon·Heo Jae-seong, Quantum Information Theory (2020), p246 ↩︎

  2. Stephen M. Barnett, Quantum Information (2009), p7-10 ↩︎