Page 34 - Read Online
P. 34

Page 29                                                                 Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02


               action in each encountered state through interacting with the environment [27] . Then, the optimal strategy is
               formulated by selecting the action with the highest Q value in each state. This strategy maximizes the expected
               return for all subsequent actions from the current state. The most important part of Q-learning is the update
               of Q value. It uses a table, i.e., Q-table, to store all Q value functions. Q-table uses state as row and action as
               column. Each (  ,   ) pair corresponds to a Q value, i.e.,   (  ,   ), in the Q-table, which is updated as follows,
                                                          [         (    )         ]
                                                                      ′  ′
                                         (  ,   ) ←    (  ,   ) +       +   max      ,     −    (  ,   )
                                                                   ′
               where    is the reward given by taking action    under state    at the current time step.    and    indicate the state
                                                                                     ′
                                                                                           ′
               and the action taken by the agent at the next time step respectively.    is the learning rate to determine how
               much error needs to be learned, and    is the attenuation of future reward. If the agent continuously accesses
               all state-action pairs, the Q-learning algorithm will converge to the optimal Q function. Q-learning is suitable
               for simple problems, i.e., small state space, or a small number of actions. It has high data utilization and stable
               convergence.

               3.2.2. Policy-based methods
               The above value-based method is an indirect approach to policy selection, and has trouble handling an infinite
               number of actions. Therefore, we want to be able to model the policy directly. Different from the value-based
               method, the policy-based algorithm does not need to estimate the value function, but directly fits the policy
               function, updates the policy parameters through training, and directly generates the best policy. In policy-
               based methods, we input a state and output the corresponding action directly, rather than the value    (  ) or
               Q value    (  ,   ) of the state. One of the most representative algorithms is strategy gradient, which is also the
               most basic policy-based algorithm.

               Policy gradient chooses to optimize the policy directly and update the parameters of the policy network by
               calculating the gradient of expected reward [29] . Therefore, its objective function    (  ) is directly designed as
               expected cumulative rewards, i.e.,
                                                             ∫
                                           (  ) = E    _  (  ) [   (  )] =     (  )       (  )      .
                                                                   (  )

               By taking the derivative of    (  ), we get



                                                     [                            ]
                                                                          
                                                         
                                                      ∑                ∑
                                                         ∇    log       (      |      )     (      ,       ) .
                                      ∇       (  ) = E          (  )
                                                        =1               =1
               The above formula consists of two parts. One is  ∑      =1 ∇    log       (      |      ) which denotes the probability of the
               gradient in the current trace. The other is  ∑        (      ,       ) which represents the return of the current trace. Since
                                                     =1
               the return is total rewards and can only be obtained after one episode, the policy gradient algorithm can only
               be updated for each episode, not for each time step.


               The expected value can be expressed in a variety of ways, corresponding to different ways ofcalculatingthe loss
               function. The advantage of the strategy gradient algorithm is that it can be applied in the continuous action
               space. In addition, the change of the action probability is smoother, and the convergence is better guaranteed.


               REINFORCE algorithm is a classic policy gradient algorithm [30] . Since the expected value of the cumulative
               reward cannot be calculated directly, the Monte Carlo method is applied to approximate the average value of
               multiple samples. REINFORCE updates the unbiased estimate of the gradient by using Monte Carlo sampling.
   29   30   31   32   33   34   35   36   37   38   39