Page 35 - Read Online

P. 35

Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02 Page 30

Each sampling generates a trajectory, which runs iteratively. After obtaining a large number of trajectories,
the cumulative reward can be calculated by using certain transformations and approximations as the loss func-
tion for gradient update. However, the variance of this algorithm is large since it needs to interact with the
environment until the terminate state. The reward for each interaction is a random variable, so each variance
will add up when the variance is calculated. In particular, the REINFORCE algorithm has three steps:

• Step 1: sample from ( | )
∑ [∑ ( ) ∑ ( )]
• Step 2: ∇ ( ) ≈ =1 ∇ log | =1 ,

• Step 3: ℎ ← + ∇ ( )
The two algorithms, value-based and policy-based methods, both have their own characteristics and disad-
vantages. Firstly, the disadvantages of the value-based methods are that the output of the action cannot be
obtained directly, and it is difficult to extend to the continuous action space. The value-based methods can
also lead to the problem of high bias, i.e., it is difficult to eliminate the error between the estimated value
function and the actual value function. For the policy-based methods, a large number of trajectories must be
sampled, and the difference between each trajectory may be huge. As a result, high variance and large gradient
noise are introduced. It leads to the instability of training and the difficulty of policy convergence.

3.2.3. Actor-critic methods
The actor-critic architecture combines the characteristics of the value-based and policy-based algorithms, and
to a certain extent solves their respective weaknesses, as well as the contradictions between high variance and
high bias. The constructed agent can not only directly output policies, but also evaluate the performance of the
current policies through the value function. Specifically, the actor-critic architecture consists of an actor which
is responsible for generating the policy and a critic to evaluate this policy. When the actor is performing, the
critic should evaluate its performance, both of which are constantly being updated [31] . This complementary
training is generally more effective than a policy-based method or value-based method.

In specific, the input of actor is state , and the output is action . The role of actor is to approximate the
policy model ( | ). Critic uses the value function as the output to evaluate the value of the policy,
and this Q value ( , ) can be directly applied to calculate the loss function of actor. The gradient of the
expected revenue function ( ) in the action-critic framework is developed from the basic policy gradient
algorithm. The policy gradient algorithm can only implement the update of each episode, and it is difficult to
accurately feedback the reward. Therefore, it has poor training efficiency. Instead, the actor-critic algorithm
(
)
replaces ∑ , with ( , ) to evaluate the expected returns of state-action tuple { , } in the

=1
current time step . The gradient of ( ) can be expressed as
[ ]
∑
∇ log ( | ) ( , ) .
∇ ( ) = E ( )
=1
3.3. Deep reinforcement learning
With the continuous expansion of the application of deep learning, its wave also swept into the RL field, result-
ing in deep reinforcement learning (DRL), i.e., using a multi-layer deep neural network to approximate value
functionorpolicyfunctionintheRLalgorithm [32,33] . DRLmainlysolvesthecurse-of-dimensionalityproblem
in real-world RL applications with large or continuous state and/or action space, where the traditional tabular
RL algorithms cannot store and extract a large amount of feature information [17,34] .

Q-learning, as a very classical algorithm in RL, is a good example to understand the purpose of DRL. The big
issue with Q-learning falls into the tabular method, which means that when state and action spaces are very
large, it cannot build a very large Q table to store a large number of Q values [35] . Besides, it counts and iterates

30 31 32 33 34 35 36 37 38 39 40