Page 36 - Read Online
P. 36

Page 31                                                                  Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02


                                          Table 1. Taxonomy of representative algorithms for DRL.
                             Types            Representative algorithms
                                              Deep Q-Network (DQN)  [37] , Double Deep Q-Network (DDQN)  [39] ,
                           Value-based
                                              DDQN with proportional prioritization  [40]
                           Policy-based       REINFORCE  [30] , Q-prop  [41]
                                              Soft Actor-Critic (SAC)  [42] , Asynchronous Advantage Actor Critic (A3C)  [43] ,
                                              Deep Deterministic Policy Gradient (DDPG)  [44] ,
                                              Distributed Distributional Deep Deterministic Policy Radients (D4PG)  [45] ,
                           Actor-critic
                                              Twin Delayed Deep Deterministic (TD3)  [46] ,
                                              Trust Region Policy Optimization (TRPO)  [47] ,
                                              Proximal Policy Optimization (PPO)  [48]
                                              Deep Belief Q-Network (DBQN)  [49] ,
                                   POMDP      Deep Recurrent Q-Network (DRQN)  [50] ,
                                              Recurrent Deterministic Policy Gradients (RDPG)  [51]
                                              Multi-Agent Importance Sampling (MAIS)  [52] ,
                     Advanced
                                              Coordinated Multi-agent DQN  [53] ,
                                  Multi-agents  Multi-agent Fingerprints (MAF)  [52] ,
                                              Counterfactual Multiagent Policy Gradient (COMAPG)  [54] ,
                                              Multi-Agent DDPG (MADDPG)  [55]


               Q values based on past states. Therefore, on the one hand, the applicable state and action space of Q-learning
               is very small. On the other hand, if a state never appears, Q-learning cannot deal with it [36] . In other words,
               Q-learning has no prediction ability and generalization ability at this point.


               In order to make Q-learning with prediction ability, considering that neural network can extract feature in-
               formation well, deep Q network (DQN) is proposed by applying deep neural network to simulate Q value
               function. In specific, DQN is the continuation of Q-learning algorithm in continuous or large state space to
               approximate Q value function by replacing Q table with neural networks [37] .


               In addition to the value-based DRL algorithm such as DQN, we summarize a variety of classical DRL algo-
               rithms according to algorithm types by referring to some DRL related surveys [38]  in Table 1, including not
               only the policy-based and actor-critic DRL algorithms, but also the advanced DRL algorithms of partially
               observable markov decision process (POMDP) and multi-agents.




               4. FEDERATED REINFORCEMENT LEARNING
               In this section, the detailed background and categories of FRL will be discussed.

               4.1. Federated reinforcement learning background
               Despite the excellent performance that RL and DRL have achieved in many areas, they still face several im-
               portant technical and non-technical challenges in solving real-world problems. The successful application
               of FL in supervised learning tasks arouses interest in exploiting similar ideas in RL, i.e., FRL. On the other
               hand, although FL is useful in some specific situations, it fails to deal with cooperative control and optimal
               decision-making in dynamic environments [10] . FRL not only provides the experience for agents to learn to
               make good decisions in an unknown environment, but also ensures that the privately collected data during
               the agent’s exploration does not have to be shared with others. A forward-looking and interesting research
               direction is how to conduct RL under the premise of protecting privacy. Therefore, it is proposed to use FL
               framework to enhance the security of RL and define FRL as a security-enhanced distributed RL framework to
               accelerate the learning process, protect agent privacy and handle not independent and identically distributed
                            [8]
               (Non-IID) data . Apart from improving the security and privacy of RL, we believe that FRL has a wider and
               larger potential in helping RL to achieve better performance in various aspects, which will be elaborated in the
               following subsections.
   31   32   33   34   35   36   37   38   39   40   41