머신러닝에서 원-핫 인코딩이란?
정의
주어진 집합 $X \subset \mathbb{R}^{n}$의 부분집합 $X_{i}$들이 다음을 만족한다고 하자.
$$ X = X_{1} \cup \cdots \cup X_{N} \quad \text{and} \quad X_{i} \cap X_{j} = \varnothing \enspace (i \ne j) $$
$\beta = \left\{ e_{1}, \dots, e_{N} \right\}$을 $\mathbb{R}^{N}$의 표준 기저라고 하자. 그러면 다음과 같은 함수, 혹은 다음과 같이 $x \in X$를 매핑하는 그 자체를 원-핫 인코딩one-hot encoding이라 한다.
$$ \begin{align*} f : X &\to \beta \\ x &\mapsto e_{i} \text{ if } x \in X_{i} \end{align*} $$
설명
머신러닝에서 데이터에 레이블을 줄 때 보편적으로 사용되는 방법이다. 특정한 하나의 성분에만 $0$이 아닌 값이 있기 때문에 one-hot하나만 불이 켜진이라 부른다. 이와 같이 매핑하는 이유는 데이터의 레이블을 양적변수가 아니라 질적변수로 다루기 위함이다. 옷 사진에 레이블로 $[1]$을 주고, 신발 사진에 레이블로 $[2]$를 준다고 해보자. 실제로 두 사진 사이에는 $2$배라는 의미가 없음에도 불구하고, 이러한 의미가 레이블에서 표현된다. 또한 예측값이 $[5]$라면 이를 그나마 $[1]$보다는 $[2]$와 가깝다고 해야할지, 예측을 실패했다고 해야할지 애매하게 된다. 따라서 이 대신 $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$, $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$과 같이 레이블을 두어 의도하지 않은 의미가 부여되지 않도록하고, 의도한 범위 내에서만 값을 얻을 수 있도록 한다. $N = \left| \beta \right|$는 데이터를 분류하고자 하는 클래스의 개수이다.
가령 MNIST 데이터를 원-핫 인코딩 한다는 것은 다음과 같다.
$ \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdgvulP%2FbtrRTtjz8Ah%2FIKWA7Ckzkjitj5X6vwd11k%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FchjNBz%2FbtrRW0nrB59%2FwUVzGwFGvVIA9iemnOmkN1%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbRAv7N%2FbtrRWLjvZku%2FCLGtZLlkuC7fKZlSZlr2u1%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F9YZyq%2FbtrRSPtGAii%2F2N3tRn9bhQhLbs0l0OKxT0%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcJpInQ%2FbtrRWZaZ4Bo%2FwE0wwSOxZZ7wrwKqCFQbA1%2Fimg.jpg}, \raisebox{0.5em}{$\enspace \cdots \enspace \mapsto e_{1} = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0\end{bmatrix}^{T}$} $
$ \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FrV2Hd%2FbtrRXuocv2o%2FEP2Tt3R7Vft3dPucw5iJz1%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FVQWMs%2FbtrRXfLytuV%2FxvEuEznI71CnPBD0fNEHmk%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbbvAq2%2FbtrRTtYkr1S%2FA45KGWUNxA2IT2mqeBVqWK%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Ftf6ng%2FbtrRXvm3jcc%2FzQouozMFozW7Eiq3Dsqqe0%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FT2gLG%2FbtrRYyJ8alW%2FIqmmahDUmM1yXhAXmg2MWK%2Fimg.jpg}, \raisebox{0.5em}{$\enspace \cdots \enspace \mapsto e_{2} = \begin{bmatrix} 0 & 1 & 0 & \cdots & 0\end{bmatrix}^{T}$} $
$ \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FppAxy%2FbtrRTtxgxbr%2F4cfRUjLAzD5TzsDopAkKt0%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwwRei%2FbtrRVK6oKTc%2FISAO9LE6Qc4j5KglwxV0K0%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FboTCrt%2FbtrRX2EGFT7%2F4SkN8ZDSHTS57Nf2CpIiz1%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxxZzL%2FbtrRVLjYEDk%2F5eQyGDM6bNjq4KNrmPltb1%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxxZzL%2FbtrRVLjYEDk%2F5eQyGDM6bNjq4KNrmPltb1%2Fimg.jpg}, \raisebox{0.5em}{$\enspace \cdots \enspace \mapsto e_{3} = \begin{bmatrix} 0 & 0 & 1 & \cdots & 0\end{bmatrix}^{T}$} $
$$\vdots$$
$ \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbRSmNg%2FbtrRTtxgz8s%2FjpZ5TGHy9d6JKjTob92PA0%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbcVpNY%2FbtrRXDE9s9S%2Fka5hNQVMgXgn8kyPD5ZBG0%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc7gcV8%2FbtrRX1lvvbZ%2FeSuCvSRoHs3scKOvfer3n1%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FNWuc9%2FbtrRX1MyDYL%2F4c0G8AJknZoDGe9zdwuBVk%2Fimg.jpg}, \includegraphics[height=2em]{https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FmG3XY%2FbtrRXGhClhU%2FsDgIVjw4Kq4KWl5PPcXyyK%2Fimg.jpg}, \raisebox{0.5em}{$\enspace \cdots \enspace \mapsto e_{10} = \begin{bmatrix} 0 & 0 & 0 & \cdots & 1\end{bmatrix}^{T}$} $