Page 32 - Read Online
P. 32
Page 27 Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02
Figure 6. The agent-environment interaction of the basic reinforcement learning model.
environment and the reward due to the previous action. Reward represents an assessment of the action
taken by agents.
More formally, we assume that there are a series of time steps = 0,1,2,... in a basic RL model. At a certain
time step , the agent will receive a state signal of the environment. In each step, the agent will select one
of the actions allowed by the state to take an action . After the environment receives the action signal ,
the environment will feed back to the agent the corresponding status signal +1 at the next step + 1 and the
immediate reward +1. The set of all possible states, i.e., the state space, is denoted as S. Similarly, the action
space is denoted as A. Since our goal is to maximize the total reward, we can quantify this total reward, usually
referred to as return with
= +1 + +2 + ... + ,
where is the last step, i.e., as the termination state. An episode is completed when the agent completes
the termination action.
In addition to this type of episodic task, there is another type of task that does not have a termination state,
in other words, it can in principle run forever. This type of task is called a continuing task. For continuous
tasks, since there is no termination state, the above definition of return may be divergent. Thus, another way
to calculate return is introduced, which is called discounted return, i.e.,
∞
∑
2
= +1 + +2 + +3 + ... = + +1 ,
=0
where the discount factor satisfies 0 ⩽ ⩽ 1. When = 1, the agent can obtain the full value of all future
steps, while when = 0, the agent can only see the current reward. As changes from 0 to 1, the agent will
gradually become forward-looking, looking not only at current interests, but also for its own future.
The value function is the agent’s prediction of future rewards, which is used to evaluate the quality of the
state and select actions. The difference between the value function and rewards is that the latter is defined as
evaluating an immediate sense for interaction while the former is defined as the average return of actions over
a long period of time. In other words, the value function of the current state = is its long-term expected
return. There are two significant value functions in the field of RL, i.e., state value function ( ) and action
value function ( , ). The function ( ) represents the expected return obtained if the agent continues
to follow strategy all the time after reaching a certain state , while the function ( , ) represents the
expected return obtained if action = is taken after reaching the current state = and the following
actions are taken according to the strategy . The two functions are specifically defined as follows, i.e.,