Page 45 - Read Online

P. 45

Qi et al. Intell Robot 2021;1(1):18-57 I http://dx.doi.org/10.20517/ir.2021.02 Page 40

which offers advantages over plain MARL in terms of solution quality and computation time.

SomeMARLalgorithmsalsoconcentrateonthecommunicationissueofPOMDP.InthestudyfromSukhbaatar
et al. [83] , communication between the agents is performed for a number of rounds before their action is se-
lected. The communication protocol is learned concurrently with the optimal policy. Foerster et al. [84] pro-
poses a deep recursive network architecture, i.e., deep distributed recurrent Q-network (DDRQN), to address
the communication problem in a multi-agent partially-observable setting. This work makes three fundamen-
tal modifications to previous algorithms. The first one is last-action inputs, which let each agent access its
previous action as an input for the next time-step. Besides, inter-agent weight sharing allows diverse behavior
between agents, as the agents receive different observations and thus evolve in different hidden states. The
final one is disabling experience replay, which is because the non-stationarity of the environment renders old
experiences obsolete or misleading. Foerster et al. [84] considers the communication task of fully cooperative,
partially observable, sequential multi-agent decision-making problems. In their system model, each agent can
receive a private observation and take actions that affect the environment. In addition, the agent can also
communicate with its fellow agents via a discrete limited-bandwidth channel. Despite the partial observability
and limited channel capacity, authors achieved the task that the two agents could discover a communication
protocol that enables them to coordinate their behavior based on the approach of deep recurrent Q-networks.

While there are some similarities between MARL and VFRL, several important differences have to be paid
attention to, i.e.,

• VFRL and some MARL algorithms are able to address similar problems, e.g., the issues of POMDP. How-
ever, there are differences between the solution ideas between two algorithms. Since VFRL is the product
of applying VFL to RL, the FL component of VFRL focuses more on the aggregation of partial features, in-
cluding states and rewards, observed by different agents since VFRL inception. Security is also an essential
issue in VFRL. On the contrary, MARL may arise as the most natural way of adding more than one agent
in a RL system [85] . In MARL, agents not only interact with the environment, but also have complex inter-
active relationships with other agents, which creates a great obstacle to the solution of policy optimization.
Therefore, the original intentions of two algorithms are different.
• Two algorithms are slightly different in terms of the structure. The agents in MARL must surely have the
reward even some of them may not have their own local actions. However, in some cases, the agents in
VFRL are not able to generate a corresponding operation policy, so in these cases, some agents have no
actions and rewards [65] . Therefore, VFRL can solve more extensive problems that MARL is not capable of
solving.
• Both two algorithms involve the communication problem between agents. In MARL, information such
as the states of other agents and model parameters can be directly and freely propagated among agents.
During communication, some MARL methods such as DDRQN in the work of Foerster et al. [84] consider
the previous action as an input for the next time-step state. Weight sharing is also allowed between agents.
However, VFRL assumes states cannot be shared among agents. Since these agents do not exchange ex-
perience and data directly, VFRL focuses more on security and privacy issues of communication between
agents, as well as how to process mid-products transferred by other agents and aggregate federated models.

In summary, as a potential and notable algorithm, VFRL has several advantages as follows, i.e.,
• Excellent privacy protection. VFRL inherits the FL algorithm’s idea of data privacy protection, so for the
task of multiple agents cooperation in the same environment, information interaction can be carried out
confidently to enhance the learning efficiency of RL model. In this process, each participant does not have
to worry about any leakage of raw real-time data.
• Wide application scenarios. With appropriate knowledge extraction methods, including algorithm design
and system modeling, VFRL can solve more real-world problems compared with MARL algorithms. This

40 41 42 43 44 45 46 47 48 49 50