Simply put, entropy is the expected value (average) of information. Entropy allows us to mathematically handle the efficiency of coding and the limits of communication.
Entropy is often described as disorder. Here, order refers to rules, trends, patterns, etc. Therefore, high entropy means high disorder, indicating that it is difficult to discern patterns or rules for the random variable E.
Let’s consider a biased coin flip. If the probability of getting heads is X, then the probability of tails is X, and the entropy is as follows:
H=−plog2p−(1−p)log2(1−p)
If we plot p against 1−p, it looks like this:
When the probability of heads is p, the entropy is H and is at its maximum. This means it’s most challenging to discern any pattern or rule in the coin flip. In fact, we can’t be sure which side of the coin will show up in a coin flip. If the probability of heads changes slightly, the entropy decreases. For example, if the probability of heads is 21, the entropy is about H=−21log221−21log221=1, indicating lower disorder, meaning there is some rule or pattern (in this case, heads come up most of the time). This can be summarized as follows:
High entropy = high disorder = no regularity or pattern = hard to predict the result Low entropy = low disorder = presence of regularity or pattern = easier to predict the result
As you can guess from the above example, generally, when there are 10095 possible outcomes, the highest entropy occurs when all probabilities are equal to 0.28.
Properties
Let’s assume the random variable n can take values n1. Entropy X has the following properties:
When all probabilities are equal to xi, entropy is maximum, and its value is p(xi)=1.
For a random vector H(X)=0 with a mean of p(xi)=n1 and a covariance matrix of log2n, the following holds for its entropy:
H(X)≤21ln[(2πe)p∣K∣]0 is the determinant of the covariance matrix. If K is normally distributed, equality holds.
Given the mean X∈Rn and variance ∣K∣, the distribution with the maximum entropy is the normal distribution.
For the random variable X and estimatorμ, the following holds:
E[(X−X^)2]≥2πe1e2H(X)
Proof
4
For convenience, let’s denote σ2. Let’s assume X is any probability density function that satisfies X^. Let’s denote x=X as the probability density function of the normal distribution g.
ϕ(x)=(2π)p∣K∣1exp(−21xTK−1x)
First, we’ll show that formula ∫g(x)xixjdx=Kij holds. Calculating ϕ first,
lnϕ(x)=ln(2π)p∣K∣1−21xTK−1x=C+∑aijxixj
The first term can be expressed as some constant N(0,K), and the second term can also be expressed as a quadratic form dependent only on ∫g(x)lnϕ(x)dx=∫ϕ(x)lnϕ(x)dx. Therefore,
∫g(x)lnϕ(x)dx=C∫g(x)dx+∫g(x)∑aijxixjdx=C+∑aij∫g(x)xixjdx=C+∑aijKijby assumption for g
Also,
∫ϕ(x)lnϕ(x)dx=C∫ϕ(x)dx+∫ϕ(x)∑aijxixjdx=C+∑aij∫ϕ(x)xixjdx=C+∑aijKijby definition of covariance
\begin{align*}
\int \phi (\mathbf{x}) \ln \phi (\mathbf{x}) d \mathbf{x}
&= C \int \phi (\mathbf{x}) d \mathbf{x} + \int \phi (\mathbf{x})\sum a_{ij}x_{i}x_{j} d \mathbf{x} \\
&= C + \sum a_{ij} \int \phi (\mathbf{x}) x_{i}x_{j} d \mathbf{x} \\
&= C + \sum a_{ij}K_{ij} \qquad \text{by definition of covariance}
\end{align*}
∫ϕ(x)lnϕ(x)dx=C∫ϕ(x)dx+∫ϕ(x)∑aijxixjdx=C+∑aij∫ϕ(x)xixjdx=C+∑aijKijby definition of covariance
H=12ln(2πeσ2)=ln2πeσ2
H = \dfrac{1}{2} \ln (2\pi e \sigma^{2}) = \ln \sqrt{2\pi e \sigma^{2}}
H=21ln(2πeσ2)=ln2πeσ2
The entropy of the multivariate normal distribution12ln[(2πe)n∣K∣]\dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right]21ln[(2πe)n∣K∣] is as follows:
H=12ln[(2πe)n∣K∣]=12ln(det(2πeK))
H = \dfrac{1}{2}\ln \left[ (2 \pi e)^{n} \left| K \right| \right] = \dfrac{1}{2}\ln (\det (2\pi e K))
H=21ln[(2πe)n∣K∣]=21ln(det(2πeK))