Page 33 - Read Online
P. 33

Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02      Page 28

















                                  Figure 7. The categories and representative algorithms of reinforcement learning.




                                                     (  ) = E    [      |      =   ] , ∀   ∈ S
                                              (  ,   ) = E    [      |      =   ,       =   ] , ∀   ∈ S,    ∈ A.


               The results of RL are action decisions, called as the policy. The policy gives agents the action    that should
               be taken for each state   . It is noted as =   (      =   |      =   ), which represents the probability of taking action
                     =    in state       =   . The goal of RL is to learn the optimal policy that can maximize the value function by
               interacting with the environment. Our purpose is not to get the maximum reward after a single action in the
               short term, but to get more reward in the long term. Therefore, the policy can be figured out as,

                                                    =       max      (  ) , ∀   ∈ S.
                                                  ∗
                                                            

               3.2. Categories of reinforcement learning
               In RL, there are several categories of algorithms. One is value-based and the other is policy-based. In addition,
               there is also an actor-critic algorithm that can be obtained by combining the two, as shown in Figure 7.


               3.2.1. Value-based methods
               Recursively expand the formulas of the action value function, the corresponding Bellman equation is obtained,
               which describes the recursive relationship between the value function of the current state and subsequent state.
               The recursive expansion formula of the action value function       (  ,   ) is

                                                           [                        ]
                                             ∑    (  ′    )     ∑ (    ′  ′  )  (  ′  ′  )
                                          (  ,   ) =        ,   |  ,        +           |            ,     ,
                                                 ,                  ′
                                              ′
                                         )
               where the function       ,   |  ,    =      {      =    ,       =   |     −1 =   ,      −1 =   } defines the trajectory probability to
                                 ( ′
                                                     ′
               characterize the environment’s dynamics.       =    indicates the reward obtained by the agent taking action
                    −1 =    in state      −1 =   . Besides,       =    and       =    respectively represent the state and the action taken by
                                                 ′
                                                           ′
               the agent at the next moment   .
               In the value-based algorithms, the above value function       (  ,   ) is calculated iteratively, and the strategy is
               then improved based on this value function. If the value of every action in a given state is known, the agent can
               select an action to perform. In this way, if the optimal       (  ,    =    ) can be figured out, the best action    will
                                                                                                      ∗
                                                                      ∗
               be found. There are many classical value-based algorithms, including Q-learning [27] , state–action–reward–
               state–action (SARSA) [28] , etc.

               Q-learning is a typical widely-used value-based RL algorithm. It is also a model-free algorithm, which means
               that it does not need to know the model of the environment but directly estimates the Q value of each executed
   28   29   30   31   32   33   34   35   36   37   38