Page 34 - Read Online

P. 34

Page 29 Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02

action in each encountered state through interacting with the environment [27] . Then, the optimal strategy is
formulated by selecting the action with the highest Q value in each state. This strategy maximizes the expected
return for all subsequent actions from the current state. The most important part of Q-learning is the update
of Q value. It uses a table, i.e., Q-table, to store all Q value functions. Q-table uses state as row and action as
column. Each ( , ) pair corresponds to a Q value, i.e., ( , ), in the Q-table, which is updated as follows,
[ ( ) ]
′ ′
( , ) ← ( , ) + + max , − ( , )
′
where is the reward given by taking action under state at the current time step. and indicate the state
′
′
and the action taken by the agent at the next time step respectively. is the learning rate to determine how
much error needs to be learned, and is the attenuation of future reward. If the agent continuously accesses
all state-action pairs, the Q-learning algorithm will converge to the optimal Q function. Q-learning is suitable
for simple problems, i.e., small state space, or a small number of actions. It has high data utilization and stable
convergence.

3.2.2. Policy-based methods
The above value-based method is an indirect approach to policy selection, and has trouble handling an infinite
number of actions. Therefore, we want to be able to model the policy directly. Different from the value-based
method, the policy-based algorithm does not need to estimate the value function, but directly fits the policy
function, updates the policy parameters through training, and directly generates the best policy. In policy-
based methods, we input a state and output the corresponding action directly, rather than the value ( ) or
Q value ( , ) of the state. One of the most representative algorithms is strategy gradient, which is also the
most basic policy-based algorithm.

Policy gradient chooses to optimize the policy directly and update the parameters of the policy network by
calculating the gradient of expected reward [29] . Therefore, its objective function ( ) is directly designed as
expected cumulative rewards, i.e.,
∫
( ) = E _ ( ) [ ( )] = ( ) ( ) .
( )

By taking the derivative of ( ), we get

[ ]

∑ ∑
∇ log ( | ) ( , ) .
∇ ( ) = E ( )
=1 =1
The above formula consists of two parts. One is ∑ =1 ∇ log ( | ) which denotes the probability of the
gradient in the current trace. The other is ∑ ( , ) which represents the return of the current trace. Since
=1
the return is total rewards and can only be obtained after one episode, the policy gradient algorithm can only
be updated for each episode, not for each time step.

The expected value can be expressed in a variety of ways, corresponding to different ways ofcalculatingthe loss
function. The advantage of the strategy gradient algorithm is that it can be applied in the continuous action
space. In addition, the change of the action probability is smoother, and the convergence is better guaranteed.

REINFORCE algorithm is a classic policy gradient algorithm [30] . Since the expected value of the cumulative
reward cannot be calculated directly, the Monte Carlo method is applied to approximate the average value of
multiple samples. REINFORCE updates the unbiased estimate of the gradient by using Monte Carlo sampling.

29 30 31 32 33 34 35 36 37 38 39