Page 44 - Read Online

P. 44

Page 39 Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02

network, i.e., multiLayerperceptron(MLP),andtakesitsownQ-networkoutputandencryptionvalueasinput
to calculate a global Q-network output. Based on the output of global Q-network, the shared value network
and self Q-network are updated. Two agents are used in the FedRL algorithm, i.e., agent and , which
interact with the same environment. However, agent cannot build its own policies and rewards. Finally,
FedRL is applied in two different games, i.e., Grid-World and Text2Action, and achieves better results than the
other baselines. Although the VFRL model in this paper only contains two agents, and the structure of the
aggregated neural network model is relatively simple, we believe that it is a great attempt to first implement
VFRL and verify its effectiveness.

Multi-agent RL (MARL) is very closely related to VFRL. As the name implies, MARL takes into account the
existence of multiple agents in the RL system. However, the empirical evaluation shows that applying the
simple single-agent RL algorithms directly to scenarios of multiple agents cannot converge to the optimal
solution, since the environment is no longer static from the perspective of each agent [66] . In specific, the
action of each agent will affect the next state, thus affecting all agents in the future time step [67] . Besides, the
actions performed by one certain agent will yield different rewards depending on the actions taken by other
agents. This means that agents in MARL correlate with each other, rather than being independent of each
other. This challenge, called as the non-stationarity of the environment, is the main problem to be solved in
the development of an efficient MARL algorithm [68] .

MARL and VFRL both study the problem of multiple agents learning concurrently how to solve a task by
interacting with the same environment [69] . Since MARL and VFRL have a large range of similarities, the
review of MARL’s related works is a very useful guide to help researchers summarize the research focus and
better understand VFRL. There is abundant literature related to MARL. However, most MARL research [70–73]
is based on a fully observed markov decision process (MDP), where each agent is assumed to have the global
observation of the system state [68] . These MARL algorithms are not applicable to the case of POMDP where
the observations of individual agents are often only a part of the overall environment [74] . Partial observability
is a crucial consideration for the development of algorithms that can be applied to real-world problems [75] .
Since VFRL is mainly oriented towards POMDP scenarios, it is more important to analyze the related works
of MARL based on POMDP as the guidance of VFRL.

Agents in the above scenarios partially observe the system state and make decisions at each step to maximize
the overall rewards for all agents, which can be formalized as a decentralized partially observable markov de-
cision process (Dec-POMDP) [76] . Optimally addressing a Dec-POMDP model is well known to be a very
challenging problem. In the early works, Omidshafiei et al. [77] proposes a two-phase MT-MARL approach
that concludes the methods of cautiously-optimistic learners for action-value approximation and concurrent
experience replay trajectories (CERTs) as the experience replay targeting sample-efficient and stable MARL.
The authors also apply the recursive neural network (RNN) to estimate the non-observed state and hysteretic
Q-learning to address the problem of non-stationarity in Dec-POMDP. Han et al. [78] designs a neural net-
work architecture, IPOMDP-net, which extends QMDP-net planning algorithm [79] to MARL settings under
POMDP. Besides, Mao et al. [80] introduces the concept of information state embedding to compress agents’
histories and proposes an RNN model combining the state embedding. Their method, i.e., embed-then-learn
pipeline, is universal since the embedding can be fed into any existing partially observable MARL algorithm as
the black-box. In the study from Mao et al. [81] , the proposed Attention MADDPG (ATT-MADDPG) has sev-
eral critic networksforvariousagentsunder POMDP. A centralized critic is adopted to collect the observations
and actions of the teammate agents. Specifically, the attention mechanism is applied to enhance the centralized
critic. The final introduced work is from Lee et al. [82] . They present an augmenting MARL algorithm based on
pretraining to address the challenge in disaster response. It is interesting that they use behavioral cloning (BC),
a supervised learning method where agents learn their policy from demonstration samples, as the approach
to pretrain the neural network. BC can generate a feasible Dec-POMDP policy from demonstration samples,

39 40 41 42 43 44 45 46 47 48 49