Page 35 - Read Online
P. 35

Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02      Page 30


               Each sampling generates a trajectory, which runs iteratively. After obtaining a large number of trajectories,
               the cumulative reward can be calculated by using certain transformations and approximations as the loss func-
               tion for gradient update. However, the variance of this algorithm is large since it needs to interact with the
               environment until the terminate state. The reward for each interaction is a random variable, so each variance
               will add up when the variance is calculated. In particular, the REINFORCE algorithm has three steps:

                • Step 1: sample       from       (      |      )
                                 ∑ [∑           (          ) ∑     (          )]
                • Step 2: ∇       (  ) ≈        =1  ∇    log          |         =1        ,      
                                                    
                                                                
                • Step 3:   ℎ       ←    +   ∇       (  )
               The two algorithms, value-based and policy-based methods, both have their own characteristics and disad-
               vantages. Firstly, the disadvantages of the value-based methods are that the output of the action cannot be
               obtained directly, and it is difficult to extend to the continuous action space. The value-based methods can
               also lead to the problem of high bias, i.e., it is difficult to eliminate the error between the estimated value
               function and the actual value function. For the policy-based methods, a large number of trajectories must be
               sampled, and the difference between each trajectory may be huge. As a result, high variance and large gradient
               noise are introduced. It leads to the instability of training and the difficulty of policy convergence.


               3.2.3. Actor-critic methods
               The actor-critic architecture combines the characteristics of the value-based and policy-based algorithms, and
               to a certain extent solves their respective weaknesses, as well as the contradictions between high variance and
               high bias. The constructed agent can not only directly output policies, but also evaluate the performance of the
               current policies through the value function. Specifically, the actor-critic architecture consists of an actor which
               is responsible for generating the policy and a critic to evaluate this policy. When the actor is performing, the
               critic should evaluate its performance, both of which are constantly being updated [31] . This complementary
               training is generally more effective than a policy-based method or value-based method.

               In specific, the input of actor is state      , and the output is action      . The role of actor is to approximate the
               policy model       (      |      ). Critic uses the value function    as the output to evaluate the value of the policy,
               and this Q value    (      ,       ) can be directly applied to calculate the loss function of actor. The gradient of the
               expected revenue function    (  ) in the action-critic framework is developed from the basic policy gradient
               algorithm. The policy gradient algorithm can only implement the update of each episode, and it is difficult to
               accurately feedback the reward. Therefore, it has poor training efficiency. Instead, the actor-critic algorithm
                            (
                                 )
               replaces  ∑           ,    with    (      ,       ) to evaluate the expected returns of state-action tuple {      ,       } in the
                                
                                   
                          =1        
               current time step   . The gradient of    (  ) can be expressed as
                                                      [                         ]
                                                       ∑
                                                          ∇    log       (      |      )    (      ,       ) .
                                       ∇       (  ) = E          (  )
                                                         =1
               3.3. Deep reinforcement learning
               With the continuous expansion of the application of deep learning, its wave also swept into the RL field, result-
               ing in deep reinforcement learning (DRL), i.e., using a multi-layer deep neural network to approximate value
               functionorpolicyfunctionintheRLalgorithm [32,33] . DRLmainlysolvesthecurse-of-dimensionalityproblem
               in real-world RL applications with large or continuous state and/or action space, where the traditional tabular
               RL algorithms cannot store and extract a large amount of feature information [17,34] .


               Q-learning, as a very classical algorithm in RL, is a good example to understand the purpose of DRL. The big
               issue with Q-learning falls into the tabular method, which means that when state and action spaces are very
               large, it cannot build a very large Q table to store a large number of Q values [35] . Besides, it counts and iterates
   30   31   32   33   34   35   36   37   38   39   40