Probability Distributions under Conditional Probability in Mathematical Statistics
📂Mathematical StatisticsProbability Distributions under Conditional Probability in Mathematical Statistics
Definition
For a discrete random vector (X,Y), let pX,Y be the joint probability mass function of (X,Y). Let pX be the marginal probability mass function of X. In this case, the following pY∣X, given Y=y, is called the conditional probability mass function of X.
pY∣X(y∣x)=pX(x)pX,Y(x,y)
For a continuous random vector (X,Y), let fX,Y be the joint probability density function of (X,Y). Let fX be the marginal probability density function of Y. In this case, the following fY∣X, given Y=y, is called the conditional probability density function of X.
fY∣X(y∣x)=fX(x)fX,Y(x,y)
Generalization
For a discrete random vector (X1,…,Xn), let p1,…,n be the joint probability mass function of (X1,…,Xn). Let p1,…,n−1 be the joint probability mass function of X1,…,Xn−1. In this case, the following pn∣1,…,n−1, given X1=x1,…,Xn−1=xn−1, is called the conditional probability mass function of Xn.
pn∣1,…,n−1(xn∣x1,…,xn−1)=p1,…,n−1(x1,…,xn−1)p1,…,n(x1,…,xn)
For a continuous random vector (X1,…,Xn), let f1,…,n be the joint probability density function of (X1,…,Xn). Let f1,…,n−1 be the joint probability density function of X1,…,Xn−1. In this case, the following fn∣1,…,n−1, given X1=x1,…,Xn−1=xn−1, is called the conditional probability density function of Xn.
fn∣1,…,n−1(xn∣x1,…,xn−1)=f1,…,n−1(x1,…,xn−1)f1,…,n(x1,…,xn)
For a discrete random vector (X1,…,Xn), the following p2,⋯,n∣1 is called the joint conditional probability mass function of X2,⋯,Xn given X1=x1.
p2,⋯,n∣1(x2,⋯,xn∣x1)=p1(x1)p1,⋯,n(x1,x2,⋯,xn)
For a continuous random vector (X1,…,Xn), the following f2,⋯,n∣1 is called the joint conditional probability density function of X2,⋯,Xn given X1=x1.
f2,⋯,n∣1(x2,⋯,xn∣x1)=f1(x1)f1,⋯,n(x1,x2,⋯,xn)
Expected Value
- Given a function u of X2,⋯,Xn, the following is called the conditional expected value of u(X2,⋯,Xn) given X1=x1.
=E[u(X2,⋯,Xn)∣x1]∫−∞∞⋯∫−∞∞u(x2,⋯,xn)f2,⋯,n∣1(x2,⋯,xn∣x1)dx2⋯,dxn
Theorem
[1] For convenience, let’s represent p(xn∣x1,…,xn−1)=pn∣1,…,n−1(xn∣x1,…,xn−1), p(x1,…,xn)=p1,…,n(x1,…,xn) by omitting subscripts. The following holds. The following holds.
p(x1,…,xn)=p(x1)p(x2∣x1)p(x3∣x2,x1)⋯p(xn∣x1,…,xn−1)
or
p(x1,…,xn)=p(xn)p(xn−1∣xn)p(xn−2∣xn−1,xn)⋯p(x1∣x2,…,xn)
- [1-1] If {Xt} is a Markov chain, then the following holds.
p(x1,…,xn)=p(x1)p(x2∣x1)p(x3∣x2)⋯p(xn∣xn−1)=p(x1)t=2∏np(xt∣xt−1)
Moreover, the following holds.
p(x2,…,xn∣x1)=t=2∏np(xt∣xt−1)
[2] p(x1,x2∣x3,x4)=p(x1∣x2,x3,x4)p(x2∣x3,x4) or
p(x1∣x2,x3,x4)=p(x2∣x3,x4)p(x1,x2∣x3,x4)
[3] Conditional Variance:
Var(X2∣X1=x1)==E[(X2−E(X2∣X1=x1))2∣X1=x1]E(X22∣X1=x1)−[E(X2∣X1=x1)]2
[4]: E[E(X2∣X1)]=E(X2)
[5]: If Var(X2) exists, then Var[E(X2∣X1)]≤Var(X2)
Explanation
Conditional probability and conditional expected value belong to the most challenging parts to compute in mathematical statistics, as they do in curriculum-level studies. More calculations naturally arise with multivariate cases. Of course, the concept of conditionality is worth the complexity. While mathematical statistics rely primarily on calculus, evolving into probability theory based on measure theory makes these calculations considerably more straightforward. The key takeaway is “Don’t dismiss it, but don’t obsess over it either.”
(3) and (4) can be obtained by repeatedly applying (1) and (2).
Proof
[1]
From (1), moving the denominator gives the following:
p(x1,…,xn−1)p(xn∣x1,…,xn−1)=p(x1,…,xn)
⟹p(x1,…,xn)=p(x1,…,xn−1)p(xn∣x1,…,xn−1)
Reapplying (5) to p(x1,…,xn−1) gives:
p(x1,…,xn)=[p(x1,…,xn−2)p(xn−1∣x1,…,xn−2)]p(xn∣x1,…,xn−1)
Reapplying (5) to p(x1,…,xn−2) results in:
p(x1,…,xn)=[p(x1,…,xn−3)p(xn−2∣x1,…,xn−3)]p(xn−1∣x1,…,xn−2)p(xn∣x1,…,xn−1)
Continuing this repetition eventually gives:
p(x1,…,xn)=p(x1)p(x2∣x1)p(x3∣x2,x1)⋯p(xn∣x1,…,xn−1)
If {Xt} is a Markov chain, since p(xn∣xn−1,…,x1)=p(xn∣xn−1) holds, the following expression is obtained:
p(x1,…,xn)=p(x1)p(x2∣x1)p(x3∣x2)⋯p(xn∣xn−1)=p(x1)t=2∏np(xt∣xt−1)
By (3), since p(x1,…,xn)=p(x1)p(x2,…,xn∣x1), comparing with the expression above yields:
p(x1,…,xn)=p(x1)t=2∏np(xt∣xt−1)=p(x1)p(x2,…,xn∣x1)
⟹p(x2,…,xn∣x1)=t=2∏np(xt∣xt−1)
Also, applying indices in reverse order in the definition of conditional probability yields:
p(x1,…,xn)=p(x2,…,xn)p(x1∣x2,…,xn)=p(x3,…,xn)p(x2∣x3,…,xn)p(x1∣x2,…,xn)=⋮=p(xn)p(xn−1∣xn)p(xn−2∣xn−1,xn)⋯p(x1∣x2,…,xn)
■
[2]
p(x1,x2∣x3,x4)=p(x3,x4)p(x1,x2,x3,x4)=p(x3,x4)p(x1,x2,x3,x4)p(x2,x3,x4)p(x2,x3,x4)=p(x2,x3,x4)p(x1,x2,x3,x4)p(x3,x4)p(x2,x3,x4)=p(x1∣x2,x3,x4)p(x2∣x3,x4)
■
See Also