What is Reinforcement Learning in Machine Learning 📂Machine Learning

What is Reinforcement Learning in Machine Learning

Definition

Reinforcement learning is the process where an agent interacts with the environment to find a policy that maximizes the cumulative reward.

Description¹

The elements comprising reinforcement learning are as follows:

Agent: Decides actions based on a policy, given a state.
State: Refers to the situation in which the agent is placed.
Action: Refers to the choices available to the agent in a given state.
Policy: Refers to the strategy according to which the agent decides its actions in a given state.
Reward: Refers to the score the agent receives based on the actions it chooses in a given state. It can be considered the goal the agent aims to achieve.
Environment: When the agent decides an action in a given state, according to the MDP (Markov Decision Process), the next state and the corresponding reward are determined.
Episode: Refers to the period from when the interaction between the agent and the environment begins to when it ends.

This can be analogized in various scenarios as follows:

Reinforcement Learning	Studying for Exams	Go Game
Agent	Student	Go Player
State	Days Left Until Exam	Go Board
Action	Study, Drinking, Gaming, etc.	Placing a Stone
Policy	Study Plan per Day	Strategy
Reward	Exam Score	Win/Loss
Episode	Exam Period	One Game

Problem of Reinforcement Learning: Grid Model

A typical example to explain reinforcement learning is the grid world. Let’s consider a robot that can move one square at a time in four directions: up, down, left, and right, in the following $4 \times 4$ grid. The starting square is randomly chosen from $\boxed{\ 2\ }$ to $\boxed{15}$ , and the goal is to reach $\fcolorbox{black}{darkgray}{\ 1\ }$ or $\fcolorbox{black}{darkgray}{16}$ in the shortest distance possible.

Agent

In reinforcement learning, the agent is the subject of learning, although it does not exist in reality. Unlike other concepts defined later, such as random variables, the agent does not have a clear mathematical definition. Therefore, it’s possible to study the theory of reinforcement learning without the concept of an agent, and indeed, this is often the case. Essentially, in the theory of reinforcement learning, the policy is what represents the agent. However, for convenience, it’s common to think of the learning subject as an entity and express it as 'the agent acts', 'the state of the agent has changed', etc. The agent is merely something that appears to be learning in computer simulations (especially games). For instance, the movement of an agent in the grid model can be expressed simply as a sequence of states. $\boxed{\ 3\ } \to \boxed{\ 2\ } \to \fcolorbox{black}{darkgray}{\ 1\ }$ You only need to print the sequence $3, 2, 1$ . Ultimately, what we want to obtain from reinforcement learning is essentially a policy, so learning can occur even without defining an agent. In short, an agent can be considered the visualization (materialization) of a policy.

Of course, the above is true in theory and computer simulations. In actual applications such as autonomous driving, there is a need for drones or cars that move according to policies. In this case, robots or machines like drones and cars become agents, and without them, learning the policy would be impossible.

State

The state is a random variable, denoted by $S$ . As the episode progresses sequentially over time, we use the index $t$ for time. Hence, the state function at time $t$ is denoted as $S_{t}$ . The initial state is usually represented as $t=0$ . Summarizing, $S_{t}$ is a function that gives function values for each of the squares at time $t$ .

$S_{t} \left( \boxed{ N } \right) = n,\quad 1\le n \le 16$

In this case, the set of all possible state values (the function values of the state function) is denoted as $\mathcal{S}\subset \mathbb{R}$ , and its elements are denoted as $s$ .

$\mathcal{S} = \left\{ s_{1}, s_{2},\dots \right\}$

Therefore, the state function for the above grid model is as follows.

$S_{t} : \left\{ \fcolorbox{black}{darkgray}{\ 1\ } , \boxed{\ 2\ }, \dots, \boxed{15}, \fcolorbox{black}{darkgray}{16} \right\} \to \mathcal{S} \\ S_{t} \left( \boxed{\ n\ } \right) = s_{n} = n,\quad 1\le n \le 16$

Then, the probability that the state value at time $t$ was $s_6$ and changes to $s_{10}$ in the next time step is as follows.

$P \left( S_{t+1} = s_{10} | S_{t} = s_{6} \right)$

A state where the episode ends once reached is called a terminal state. In the above grid model, the terminal states are $\fcolorbox{black}{darkgray}{1}, \fcolorbox{black}{darkgray}{16}$ .

Action

The action refers to the choices the agent can make in the current state, and it is also a random variable. Denoted as $A_{t}$ , it represents the action at time $t$ . In the grid model example above, one can choose up, down, left, or right in each of the squares from $\boxed{2}$ to $\boxed{15}$ . The set of all possible action values (the function values of the action function) is denoted as $\mathcal{A}\subset \mathbb{R}$ , and its elements are denoted as $a$ .

$\mathcal{A} = \left\{ a_{1}, a_{2}, \dots \right\}$

Then, the action function at time $t$ is as follows.

$A_{t} : \left\{ \uparrow, \rightarrow, \downarrow, \leftarrow \right\} \to \mathcal{A} \\ \begin{cases} A_{t}(\uparrow) = a_{1} \\ A_{t}(\rightarrow) = a_{2} \\ A_{t}(\downarrow) = a_{3} \\ A_{t}(\leftarrow) = a_{4} \end{cases}$

The agent decides its actions based on probabilities given the current state. For example, the probability of choosing action $a_1$ in state $s_6$ at time $t$ is as follows.

$P(A_{t} = a_{1} | S_{t} = s_{6})$

Policy

The policy specifies the probability of deciding action $a$ in state $s$ for all $s$ and $a$ , denoted as $\pi$ . It’s analogous to a strategy in a game or war. In the grid model example, if the probability of deciding each action is $\dfrac{1}{4}$ , the policy $\pi$ is as follows.

$\pi \begin{cases} P(a_{1} | s_{2}) = \dfrac{1}{4} \\ P(a_{2} | s_{2}) = \dfrac{1}{4} \\ P(a_{3} | s_{2}) = \dfrac{1}{4} \\ \vdots \\ P(a_{2} | s_{15}) = \dfrac{1}{4} \\ P(a_{3} | s_{15}) = \dfrac{1}{4} \\ P(a_{4} | s_{15}) = \dfrac{1}{4} \end{cases} \quad \text{or} \quad \pi : \mathcal{S} \times \mathcal{A} \to [0,1]$

Of course, this is not an optimized policy. Considering just the case of $\boxed{2}$ , it’s better not to have any probability of going upwards as it would go out of the grid. Therefore, $\pi_2$ can be said to be a better policy than $\pi_1$ as shown in the illustration below.

The goal of reinforcement learning algorithms is to find the optimal policy. The question then arises, how do we find the optimal policy? It can be found through the value function, which evaluates the quality of a policy.

Reward

The reward is a function that maps a real number based on the action chosen by the agent in a given state, denoted as $R_t$ . The set of all reward values (the function values of the reward function) is denoted as $\mathcal{R} \subset \mathbb{R}$ , and its elements are denoted as $r$ .

$\mathcal{R} = \left\{ r_{1}, r_{2}, \dots \right\} \\ R_{t} = \mathcal{S} \times \mathcal{A} \to \mathcal{R}$

A reward is received once at each time step, and the ultimate goal of reinforcement learning is to find a policy that maximizes the cumulative reward, the total reward received in an episode.

One might wonder why we focus on maximizing cumulative rewards rather than rewards at each time step. This can be easily understood through the analogy of studying for exams. During the exam period, spending every evening drinking and partying or playing games might be more enjoyable than studying. However, the cumulative reward, i.e., the exam score, would be terrible. Hence, even if studying is tiring and challenging in the moment, it’s considered better for the sake of future greater rewards.

The reward is a hyperparameter set by a person and should be appropriately decided based on the task the agent must perform. For instance, in the grid model example, if the grid is a maze and the agent is a robot escaping the maze, a reward of $-1$ for moving one square and a reward of $+10$ for reaching a terminal state can be set. If the grid is a park and the agent is a robot walking a pet, a reward of $0$ for moving one square and a reward of $+10$ for reaching a terminal state can be set.

Environment

The environment is a function that decides the next state and reward based on the action chosen by the agent in a given state, i.e., $f : (s,a) \mapsto (s^{\prime},r)$ . Therefore, it’s always challenging to find a perfect analogy in reality.

Let’s say the state at time $t$ is $s_{t}$ , and the action chosen at $s_{t}$ is $a_t$ . Then, the next state

decided by the environment is $s_{t+1}$ , and the reward is $r_{t+1}$ . It can be represented as follows.

$f(s_{t}, a_{t}) = (s_{t+1}, r_{t+1})$

If in the grid model example, the agent chose $\uparrow$ at $\boxed{7}$ and the environment decided the next state to be $\boxed{3}$ and the reward to be $-1$ , it can be expressed with the following formula.

$f(s_{7}, a_{1}) = (s_{3}, -1)$

If the strategy by which the agent decides actions is called a policy, the environment’s decision of the next state and reward is called the MDP. The interaction between the agent and the environment can be illustrated as follows.

Episode

The sequence of states, actions, and rewards determined by the interaction between the agent and the environment is called a trajectory or history. When the trajectory is finite, it’s referred to as an episode task. The exam period, Go game, and grid model examples mentioned earlier fall into this category.

$s_{0}, a_{0}, s_{1}, r_{1}, a_{1}, s_{2}, r_{2}, \dots, a_{T-1}, s_{T}, r_{T} \\ \text{or} \\ (s_{0},) \overset{a_{0}}{\to} (s_{1}, r_{1}) \overset{a_{1}}{\to} (s_{2}, r_{2}) \overset{a_{2}}{\to} \cdots \overset{a_{T-1}}{\to} (s_{T}, r_{T})$

When the trajectory is infinite, it’s referred to as a continuing task. However, very long episodes are sometimes considered infinite.

$s_{0}, a_{0}, s_{1}, r_{1}, a_{1}, s_{2}, r_{2}, \dots, a_{t-1}, s_{t}, r_{t}, a_{t}, s_{t+1}, r_{t+1},\dots \\ \text{or} \\ (s_{0},) \overset{a_{0}}{\to} (s_{1}, r_{1}) \overset{a_{1}}{\to} (s_{2}, r_{2}) \overset{a_{2}}{\to} \cdots \overset{a_{t-1}}{\to} (s_{t}, r_{t}) \overset{a_{t}}{\to} (s_{t+1}, r_{t+1}) \overset{a_{t+1}}{\to} \cdots$

O Il-Seok, Machine Learning (MACHINE LEARNING). 2017, p466-480 ↩︎