Page 44 - Read Online
P. 44

Page 39                             Qi et al. Intell Robot 2021;1(1):18-57  I http://dx.doi.org/10.20517/ir.2021.02


               network, i.e., multiLayerperceptron(MLP),andtakesitsownQ-networkoutputandencryptionvalueasinput
               to calculate a global Q-network output. Based on the output of global Q-network, the shared value network
               and self Q-network are updated. Two agents are used in the FedRL algorithm, i.e., agent    and   , which
               interact with the same environment. However, agent    cannot build its own policies and rewards. Finally,
               FedRL is applied in two different games, i.e., Grid-World and Text2Action, and achieves better results than the
               other baselines. Although the VFRL model in this paper only contains two agents, and the structure of the
               aggregated neural network model is relatively simple, we believe that it is a great attempt to first implement
               VFRL and verify its effectiveness.


               Multi-agent RL (MARL) is very closely related to VFRL. As the name implies, MARL takes into account the
               existence of multiple agents in the RL system. However, the empirical evaluation shows that applying the
               simple single-agent RL algorithms directly to scenarios of multiple agents cannot converge to the optimal
               solution, since the environment is no longer static from the perspective of each agent [66] . In specific, the
               action of each agent will affect the next state, thus affecting all agents in the future time step [67] . Besides, the
               actions performed by one certain agent will yield different rewards depending on the actions taken by other
               agents. This means that agents in MARL correlate with each other, rather than being independent of each
               other. This challenge, called as the non-stationarity of the environment, is the main problem to be solved in
               the development of an efficient MARL algorithm [68] .


               MARL and VFRL both study the problem of multiple agents learning concurrently how to solve a task by
               interacting with the same environment [69] . Since MARL and VFRL have a large range of similarities, the
               review of MARL’s related works is a very useful guide to help researchers summarize the research focus and
               better understand VFRL. There is abundant literature related to MARL. However, most MARL research [70–73]
               is based on a fully observed markov decision process (MDP), where each agent is assumed to have the global
               observation of the system state [68] . These MARL algorithms are not applicable to the case of POMDP where
               the observations of individual agents are often only a part of the overall environment [74] . Partial observability
               is a crucial consideration for the development of algorithms that can be applied to real-world problems [75] .
               Since VFRL is mainly oriented towards POMDP scenarios, it is more important to analyze the related works
               of MARL based on POMDP as the guidance of VFRL.

               Agents in the above scenarios partially observe the system state and make decisions at each step to maximize
               the overall rewards for all agents, which can be formalized as a decentralized partially observable markov de-
               cision process (Dec-POMDP) [76] . Optimally addressing a Dec-POMDP model is well known to be a very
               challenging problem. In the early works, Omidshafiei et al. [77]  proposes a two-phase MT-MARL approach
               that concludes the methods of cautiously-optimistic learners for action-value approximation and concurrent
               experience replay trajectories (CERTs) as the experience replay targeting sample-efficient and stable MARL.
               The authors also apply the recursive neural network (RNN) to estimate the non-observed state and hysteretic
               Q-learning to address the problem of non-stationarity in Dec-POMDP. Han et al. [78]  designs a neural net-
               work architecture, IPOMDP-net, which extends QMDP-net planning algorithm [79]  to MARL settings under
               POMDP. Besides, Mao et al. [80]  introduces the concept of information state embedding to compress agents’
               histories and proposes an RNN model combining the state embedding. Their method, i.e., embed-then-learn
               pipeline, is universal since the embedding can be fed into any existing partially observable MARL algorithm as
               the black-box. In the study from Mao et al. [81] , the proposed Attention MADDPG (ATT-MADDPG) has sev-
               eral critic networksforvariousagentsunder POMDP. A centralized critic is adopted to collect the observations
               and actions of the teammate agents. Specifically, the attention mechanism is applied to enhance the centralized
               critic. The final introduced work is from Lee et al. [82] . They present an augmenting MARL algorithm based on
               pretraining to address the challenge in disaster response. It is interesting that they use behavioral cloning (BC),
               a supervised learning method where agents learn their policy from demonstration samples, as the approach
               to pretrain the neural network. BC can generate a feasible Dec-POMDP policy from demonstration samples,
   39   40   41   42   43   44   45   46   47   48   49