Categorical Distribution
Definition1
Given a sample space with $k (\ge 2)$ categories, $\Omega = \left\{ 1, 2, \dots, k \right\}$, and a probability vector $\mathbf{p} = (p_{1}, \dots, p_{k})$, the discrete probability distribution with the following probability mass function is called the Categorical distribution.
$$ p(x = i) = p_{i}, \qquad x \in \left\{ 1, 2, \dots, k \right\} $$
Description
The probability of each of the $k$ categories occurring is represented by $\mathbf{p} = (p_{1}, \dots, p_{k})$. Therefore, $\mathbf{p}$ must satisfy the following condition.
$$ \sum_{i=1}^{k} p_{i} = 1, \qquad p_{i} \ge 0 $$
If the Bernoulli distribution is compared to “flipping a coin once,” the Categorical distribution can be compared to “rolling a die once.”
$\Omega = \Big\{$ $ \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FrbQJE%2FbtsMXyblpbo%2FZMTO1PeHbafLH3g97P0q41%2Fimg.png}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxAbA3%2FbtsMW8KMCtK%2FodmS8gakkTAp7dP2Lk6JO0%2Fimg.png}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcgXAjB%2FbtsMYNLR5E3%2FIED729aUwdNa093xix0sz1%2Fimg.png}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdITnZE%2FbtsMX1qB6Eo%2FLh1bDQ0SkBl4k0PVHtjZDK%2Fimg.png}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnR6Ka%2FbtsMW7kMTqC%2FM4VQF9U2wgbCfcRKfE1KdK%2Fimg.png}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcsx5aP%2FbtsMXPKIgHh%2FbnXAmN8iHjFOwzhyjtcdTK%2Fimg.png} $ $\Big\}$
$$ \mathbf{p} = \left( \dfrac{1}{6}, \dfrac{1}{6}, \dfrac{1}{6}, \dfrac{1}{6}, \dfrac{1}{6}, \dfrac{1}{6} \right) $$
The following notation is used.
$$ \operatorname{Cat}(k; p_{1}, \dots, p_{k}) = \operatorname{Cat}(k; \mathbf{p}) $$
The Categorical distribution can be considered a generalization of categories from the Bernoulli distribution to $k$ categories. Further generalizing to $n$ trials leads to the Multinomial distribution.
Category
Trials | $1$ times | $n$ times |
$2$ items | Bernoulli distribution | Binomial distribution |
$k$ items | Categorical distribution | Multinomial distribution |
The probability mass function can also be expressed as follows.
$$ p(j) = \prod\limits_{i=1}^{k} p_{i}^{\delta_{ji}} = \sum\limits_{i=1}^{k} \delta_{ji} p_{i}, \qquad j \in \left\{ 1, 2, \dots, k \right\} $$
$\delta_{ji}$ refers to the Kronecker delta.
Meanwhile, the sample space can be viewed as the standard basis of Euclidean space, and each realization can be considered as a one-hot vector. In this case, with a random vector $\mathbf{x} = (x_{1}, \dots, x_{k})$ satisfying the probability mass function, the Categorical distribution can be expressed as $\operatorname{Cat}(\mathbf{x}; \mathbf{p})$.
$$ x_{i} \in \left\{ 0, 1 \right\}, \qquad \sum_{i=1}^{k} x_{i} = 1 $$
$$ p(\mathbf{x}) = p(x_{1}, \dots, x_{k}) = \prod\limits_{i=1}^{k} p_{i}^{x_{i}} $$