Page 36 - Read Online

P. 36

Page 31 Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02

Table 1. Taxonomy of representative algorithms for DRL.
Types Representative algorithms
Deep Q-Network (DQN) [37] , Double Deep Q-Network (DDQN) [39] ,
Value-based
DDQN with proportional prioritization [40]
Policy-based REINFORCE [30] , Q-prop [41]
Soft Actor-Critic (SAC) [42] , Asynchronous Advantage Actor Critic (A3C) [43] ,
Deep Deterministic Policy Gradient (DDPG) [44] ,
Distributed Distributional Deep Deterministic Policy Radients (D4PG) [45] ,
Actor-critic
Twin Delayed Deep Deterministic (TD3) [46] ,
Trust Region Policy Optimization (TRPO) [47] ,
Proximal Policy Optimization (PPO) [48]
Deep Belief Q-Network (DBQN) [49] ,
POMDP Deep Recurrent Q-Network (DRQN) [50] ,
Recurrent Deterministic Policy Gradients (RDPG) [51]
Multi-Agent Importance Sampling (MAIS) [52] ,
Advanced
Coordinated Multi-agent DQN [53] ,
Multi-agents Multi-agent Fingerprints (MAF) [52] ,
Counterfactual Multiagent Policy Gradient (COMAPG) [54] ,
Multi-Agent DDPG (MADDPG) [55]

Q values based on past states. Therefore, on the one hand, the applicable state and action space of Q-learning
is very small. On the other hand, if a state never appears, Q-learning cannot deal with it [36] . In other words,
Q-learning has no prediction ability and generalization ability at this point.

In order to make Q-learning with prediction ability, considering that neural network can extract feature in-
formation well, deep Q network (DQN) is proposed by applying deep neural network to simulate Q value
function. In specific, DQN is the continuation of Q-learning algorithm in continuous or large state space to
approximate Q value function by replacing Q table with neural networks [37] .

In addition to the value-based DRL algorithm such as DQN, we summarize a variety of classical DRL algo-
rithms according to algorithm types by referring to some DRL related surveys [38] in Table 1, including not
only the policy-based and actor-critic DRL algorithms, but also the advanced DRL algorithms of partially
observable markov decision process (POMDP) and multi-agents.

4. FEDERATED REINFORCEMENT LEARNING
In this section, the detailed background and categories of FRL will be discussed.

4.1. Federated reinforcement learning background
Despite the excellent performance that RL and DRL have achieved in many areas, they still face several im-
portant technical and non-technical challenges in solving real-world problems. The successful application
of FL in supervised learning tasks arouses interest in exploiting similar ideas in RL, i.e., FRL. On the other
hand, although FL is useful in some specific situations, it fails to deal with cooperative control and optimal
decision-making in dynamic environments [10] . FRL not only provides the experience for agents to learn to
make good decisions in an unknown environment, but also ensures that the privately collected data during
the agent’s exploration does not have to be shared with others. A forward-looking and interesting research
direction is how to conduct RL under the premise of protecting privacy. Therefore, it is proposed to use FL
framework to enhance the security of RL and define FRL as a security-enhanced distributed RL framework to
accelerate the learning process, protect agent privacy and handle not independent and identically distributed
[8]
(Non-IID) data . Apart from improving the security and privacy of RL, we believe that FRL has a wider and
larger potential in helping RL to achieve better performance in various aspects, which will be elaborated in the
following subsections.

31 32 33 34 35 36 37 38 39 40 41