Page 32 - Read Online
P. 32

Page 27                                                                Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02























                                Figure 6. The agent-environment interaction of the basic reinforcement learning model.


                  environment and the reward due to the previous action. Reward represents an assessment of the action
                  taken by agents.
               More formally, we assume that there are a series of time steps    = 0,1,2,... in a basic RL model. At a certain
               time step   , the agent will receive a state signal       of the environment. In each step, the agent will select one
               of the actions allowed by the state to take an action      . After the environment receives the action signal      ,
               the environment will feed back to the agent the corresponding status signal      +1 at the next step    + 1 and the
               immediate reward      +1. The set of all possible states, i.e., the state space, is denoted as S. Similarly, the action
               space is denoted as A. Since our goal is to maximize the total reward, we can quantify this total reward, usually
               referred to as return with
                                                       =      +1 +      +2 + ... +       ,

               where    is the last step, i.e.,       as the termination state. An episode is completed when the agent completes
               the termination action.


               In addition to this type of episodic task, there is another type of task that does not have a termination state,
               in other words, it can in principle run forever. This type of task is called a continuing task. For continuous
               tasks, since there is no termination state, the above definition of return may be divergent. Thus, another way
               to calculate return is introduced, which is called discounted return, i.e.,

                                                                      ∞
                                                                     ∑
                                                                            
                                                           2
                                               =      +1 +        +2 +         +3 + ... =          +  +1 ,
                                                                        =0
               where the discount factor    satisfies 0 ⩽    ⩽ 1. When    = 1, the agent can obtain the full value of all future
               steps, while when    = 0, the agent can only see the current reward. As    changes from 0 to 1, the agent will
               gradually become forward-looking, looking not only at current interests, but also for its own future.

               The value function is the agent’s prediction of future rewards, which is used to evaluate the quality of the
               state and select actions. The difference between the value function and rewards is that the latter is defined as
               evaluating an immediate sense for interaction while the former is defined as the average return of actions over
               a long period of time. In other words, the value function of the current state       =    is its long-term expected
               return. There are two significant value functions in the field of RL, i.e., state value function       (  ) and action
               value function       (  ,   ). The function       (  ) represents the expected return obtained if the agent continues
               to follow strategy    all the time after reaching a certain state      , while the function       (  ,   ) represents the
               expected return obtained if action       =    is taken after reaching the current state       =    and the following
               actions are taken according to the strategy   . The two functions are specifically defined as follows, i.e.,
   27   28   29   30   31   32   33   34   35   36   37